Closed Eric-Zhang1990 closed 5 years ago
@Tomcli I am also confused about the number of gpus and learners, what I understand is each learner can use the number of gpus I set, eg: gpus is set 2, learners is also set 2, so each learner uses 2 gpus, 2 learners use 4 gpus, what I say is right? If it is right, what I want to know is that each learner runs the same code for training, I mean each learner just runs independently and saves 2 different models (2 learners for training). Or what I just say is wrong, although it has 2 learners, and each learner has 2 gpus, it will just save one model for the training job, I mean the training job uses 4 gpus for distributed training. I don't know which one is right or both are wrong. Can you explain the relationship between gpus and learners? Thank you.
Hi @Eric-Zhang1990, the bottleneck of your Caffe job is the CPU specs. With 0.5 CPU, preprocessing will take significantly more time than training the actual model. For our Caffe example running on GPU, I will recommend to put at least 2 cpus.
For the manifest specs, you are correct that the amount of gpus, cpus, and memory on the manifest are for each learner. you can find the details of the manifest specs at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file
Thanks.
@Tomcli Thank you for your kind reply. I see the info at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (learners: Number of learners to use in training. As FfDL supports distributed learning, you can have more than one learner for your training job.), I run multi learners for one training job, too. What I am confused is if I use multi learners for one job, the number of models I get is equal to number of learners, or although I use multi learners for one training job, I just get one model? Thank you.
@Tomcli I want to run maskrcnn-benchmark project (https://github.com/facebookresearch/maskrcnn-benchmark), but I find NOTE at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (Note that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.), which means I can't run job which has multi directories, right? However, all our projects have multi directories, how can I run them on FfDL? Thank you.
@Tomcli Another question: when I use my own image and data (using host-mount) for caffe training, I find it spends a long time for just 100 iters, from the log I find it spends most time on data Transform、Prefetch batch, etc. I don't know how it works? Thank you. caffe gpu-manifest.yml:
caffe training logs: