NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Cluster backend #108

Open cancan101 opened 9 years ago

cancan101 commented 9 years ago

Support for using SGE for the job execution.

lukeyeager commented 9 years ago

We're exploring using DIGITS as a frontend for an EC2 cluster. Presumably we'll design it to work with SGE (now OGE?) or whatever else you want to use.

cancan101 commented 9 years ago

I use SGE on EC2 using StarCluster now. There are some tools like http://drmaa-python.readthedocs.org/en/latest/ which abstract cluster engine is used.

SGE is aka OGE but also aka "Son of Grid Engine" :smile:

pmkulkarni commented 9 years ago

Use DIGITS to manage inter-node training. For example, this can be use to train a single model across multiple GPU instances on EC2.

loneae commented 9 years ago

I have dataset of 40 Million pics and i want to speed the training, so if i use multi EC2 GPU severs it will work and use the max usage of each EC2 GPU ? any limitations ? can i use 10 EC2 ?

lukeyeager commented 9 years ago

Caffe is currently multi-GPU but not multi-node. Until DIGITS supports a multi-node framework, having multiple instances will not improve the training time of any one training job.

DIGITS is also single-node for now. So you would have to run one DIGITS instance on each node.

loneae commented 9 years ago

I have just realised i can create a GPU cluster and combine multi node into 1 powerful machine using starcluster. http://hpc.nomad-labs.com/archives/139

loneae commented 9 years ago

And then run my caffe on multi GPU like i'm using all at the same server.

thatguymike commented 9 years ago

That isn't going to do what you want. Even if you got that dusty code to work, the interconnect between the nodes is nowhere near PCIe speeds so your scaling performance will be quite poor with the current multi-GPU methodology. To scale across nodes the algorithms would need to change to an asychronous method (EASGD for example) which has different numerical behavior and convergence properties.

loneae commented 9 years ago

Ok sound like you have experience in this field, tell me if i will not be able to run the multi GPU, can i just train batch on each server and after merge into 1 train file?

loneae commented 9 years ago

Also we can use internal AWS ip so the data transfer between server is very very speedy.

thatguymike commented 9 years ago

PCIe Gen 3 is 12GB/s. At best on AWS you get 10GbE interconnects on the general nodes, so 1.2GB/s on a good day. We are communication bound on PCIe Gen3, so scaling is not going to be that great. Again, to go to multiple nodes you need to move away from synchronous stochastic gradient descent and to a different solver, likely an asychronous techique.

c4ssio commented 9 years ago

Hi @thatguymike , thanks very much for your time and help if you can enlighten me, and apologies for the wall of text. I have a database of ~10 million wine labels that I would like to train for an image recognition system, and moreover retrain on a regular basis, as more people take pictures.

  1. Is it possible to train 1000 categories of 100 images each in DIGITS, then use the last epoch's model in a subsequent classification model with an entirely different set of 100k images to achieve a model with 2k categories? -- if yes, Can I go adding images to the same categories as more come in, so the model gets fine-tuned?
  2. (similar to @loneae 's question above) Is it possible to train multiple sets of images on distinct ec2 g2.8xlarge instances, then use a script to combine their results into a composite model including all the categories and results? If I understand correctly you said above "not with the SGD solver." Is there a Solver you would recommend for this?
thatguymike commented 9 years ago

Let's step back a bit. 10 million wine labels isn't "that big". There are a few ways to attack the problem. I would start with a strong pretrained network, like GoogleNet or VGG trained on ImageNet and then finetune for your categories. (e.g. slice the bottom layer and redefine number of outputs). It should in theory converge quite quickly. Then as you add images you can drop out parts of your older original data or just continue to grow. I assume you will need to add more categories, which will generally mean re-finetuning things.

As for 1: That isn't going to do what you want. You might train the upper layers the first round to build a good feature descriptor chain (like what I recommed above), but you just can't add more categories. When you finetune you need enough representatives in all of your categories. So your dataset is going to end up growing over time. You will want O(1k) examples per class to be robust, but it depends on your dataset and accuracy requirements.

As for 2: Yes and no. Assuming everyone starts with the same seed and roughly equal representation of the dataset, you can try to train independently and then average the weights. I don't think that will end up being stable. Better might be to train multiple models for different subsets of categories and then run them as an ensemble. e.g. train 5 different networks each with 1/5th of the categories. Then when doing inference run the input through each of the models and choose the one with the highest probability. Might work, but you can also get a category split across models where you have a model that gets very confused and ranks a label high because it can't descriminate properly.

c4ssio commented 9 years ago

Thanks for the prompt response! I clearly need to read more about this.

The issue I have is that we get thousands of new wine label brands a day, so I can't possibly know all my categories ahead of time. I can do 100k photos in 8.5h at my current rate (using g2.8xlarge, GoogLeNet, 256x256, color images), so I'd be spending 35 days building out a massive model with my entire dataset which will end up being a month stale and not include the new hotness in the wine industry.

So either I need to come up with a way to run through my entire corpus 100x faster (diminish accuracy thru downsampling, prioritize the most relevant subset of data, or beast out my hardware), or I need a different kind of solver that allows me to incrementally add new categories and add new photos to existing ones.

I like your idea about splitting out my corpus into a few different networks, but yeah, it will compromise accuracy because one network will think it's 50% sure of a match and other will think it's 80% sure and the 50% one will end up being correct because the 80% one just doesn't know enough.

OK the search continues, thanks for your help! Happy Labor Day weekend and I'll keep an eye out for exciting developments here and maybe pop on over to the BVLC folks and see if they have any ideas.

bhack commented 9 years ago

@thatguymike See some caffe experiments at http://arxiv.org/abs/1506.08272

vfdev-5 commented 7 years ago

@lukeyeager I wonder if DIGITS advanced to manage task execution on SGE clusters since issue creation ?

I would like to use DIGITS with Intel-flavoured Caffe on Colfax Cluster. In this case it is possible to hack Task class and configure it (choose LocalTask or GridEngineTask) on digits start. Then each task is executed with qsub in the interactive mode. See my fork for details. For instance some widgets are not updated during the task execution, but it is ok for the usage I need. Anyway, I would like to get your feedback.