IBM / FfDL

Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes
https://developer.ibm.com/code/patterns/deploy-and-use-a-multi-framework-deep-learning-platform-on-kubernetes/
Apache License 2.0
692 stars 187 forks source link

caffe training speed is very slow #166

Closed Eric-Zhang1990 closed 5 years ago

Eric-Zhang1990 commented 5 years ago

@Tomcli Another question: when I use my own image and data (using host-mount) for caffe training, I find it spends a long time for just 100 iters, from the log I find it spends most time on data Transform、Prefetch batch, etc. I don't know how it works? Thank you. caffe gpu-manifest.yml: image

caffe training logs: _ _20190306113512 _ _20190306113531 _ _20190306160158

Eric-Zhang1990 commented 5 years ago

@Tomcli I am also confused about the number of gpus and learners, what I understand is each learner can use the number of gpus I set, eg: _ _20190305091226 gpus is set 2, learners is also set 2, so each learner uses 2 gpus, 2 learners use 4 gpus, what I say is right? If it is right, what I want to know is that each learner runs the same code for training, I mean each learner just runs independently and saves 2 different models (2 learners for training). Or what I just say is wrong, although it has 2 learners, and each learner has 2 gpus, it will just save one model for the training job, I mean the training job uses 4 gpus for distributed training. I don't know which one is right or both are wrong. Can you explain the relationship between gpus and learners? Thank you.

Tomcli commented 5 years ago

Hi @Eric-Zhang1990, the bottleneck of your Caffe job is the CPU specs. With 0.5 CPU, preprocessing will take significantly more time than training the actual model. For our Caffe example running on GPU, I will recommend to put at least 2 cpus.

For the manifest specs, you are correct that the amount of gpus, cpus, and memory on the manifest are for each learner. you can find the details of the manifest specs at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file

Thanks.

Eric-Zhang1990 commented 5 years ago

@Tomcli Thank you for your kind reply. I see the info at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (learners: Number of learners to use in training. As FfDL supports distributed learning, you can have more than one learner for your training job.), I run multi learners for one training job, too. What I am confused is if I use multi learners for one job, the number of models I get is equal to number of learners, or although I use multi learners for one training job, I just get one model? Thank you.

Eric-Zhang1990 commented 5 years ago

@Tomcli I want to run maskrcnn-benchmark project (https://github.com/facebookresearch/maskrcnn-benchmark), 深度截图_选择区域_20190312171707 but I find NOTE at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (Note that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.), which means I can't run job which has multi directories, right? However, all our projects have multi directories, how can I run them on FfDL? Thank you.