facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.95k forks source link

Distributed training with network defined in protobuf files #348

Open sergey-serebryakov opened 7 years ago

sergey-serebryakov commented 7 years ago

Hi! I've been experimenting with Caffe2. I could convert original Caffe models that we have ([net.deploy.prototxt, net.deploy.caffemodel] -> [net.init.pb, net.deploy.pb]) and run them (train) in a single GPU/CPU mode in Caffe2. This was easy and smooth. Everything worked as expected. Guides and source code helped a lot! The question I have is if there is an easy way to train these translated networks in a distributed setup. The data_parallel_model.Parallelize_GPU helper accepts cnn.CNNModelHelper that does not seem to support serialization. Concretely, what would be the best way to go from *.pb files (model and definition) to a distributed training? Thanks!

Yangqing commented 7 years ago

/cc @akyrola

Aapo, what do you think is the best way for us to do finetuning, especially with an existing model?

My gut feeling is that, one can manually feed the parameters to the workspace (with device scopes) and then skip the init net. Caffe and Caffe2 training mechanism are indeed sufficiently different that training from a predefined proto is a bit tricky.

Another thing I can think of is to have some script that generates a multi-gpu training plan from a single-gpu net.

akyrola commented 7 years ago

If you have model building code in python, you should just do everything Caffe2 way, and then you load the parameters from a pretrained model to replace the ones that were populated by the param_init_net. Just call data_parallel_model.FinalizeAfterCheckpoint() so that the parameters are distributed to all hosts. Unfortunately I don't think we have any example of this in the open source tree... I plan to modify resnet50_trainer.py at some point to include checkpointing though.

But if you just work on a protobuffer models, we don't have a solution for you to parallelize the training. It would not be, at least in principle too difficult to take a converted protobuffer, recognize the gradient operators and parameter update operators and duplicate the model for all GPUs. But we don't have plans to support this. Happy to help someone implement this though.