developmentseed / pearl-ml-pipeline

12 stars 3 forks source link

Optimize GPU/CPU cluster usage #7

Open srmsoumya opened 2 years ago

srmsoumya commented 2 years ago

Training Pipeline takes ~2 Hours to run the training script for a sample space, for eg: Fort Collins region.

We are creating a CPU Cluster & GPU Cluster with 4 nodes each. From the experiment logs, it looks like we are using a single GPU node for training. Is it because of CPU bottleneck or we don't have distributed code in place?

cc/ @geohacker

srmsoumya commented 2 years ago

I am meeting with Nana today, may be I can check this with her.

geohacker commented 2 years ago

@srmsoumya worth checking with Nana. I also think it's worth just reducing the cluster capacity and seeing what happens.