AntreasAntoniou / HowToTrainYourMAMLPytorch

The original code for the paper "How to train your MAML" along with a replication of the original "Model Agnostic Meta Learning" (MAML) paper in Pytorch.
https://arxiv.org/abs/1810.09502
Other
773 stars 137 forks source link

Questions about MAML training time #10

Open ptkin opened 5 years ago

ptkin commented 5 years ago

Hi Antreas,

Thank you for your wonderful work MAML++!

Currently I'm doing some few-shot learning experiments based on your code, but I found that the experiments take such a long time to converge, so I would like to ask some questions.

I experimented with a single V100 GPU, where the 5-way 1-shot miniImagenet MAML experiment takes about 5 hours (totally 500*100 iterations), and the Omniglot experiment takes about 15 hours. I think the training time is a bit too long, so I'm wondering how much time the experiments you conducted took?

In addition, I found that it doesn't make good use of the GPU's computing power (high-order derivative enabled), the GPU utilization is constantly changing and no more than 50%. I double-checked the training process and found that the main time overhead was in the loss.backward step (more than 70%). So I think the reason why the training process takes such a long time is that too little support set data can't fully exploit the computing advantages of the GPU, and high-order derivative calculation consumes so much time. What do you think about this problem? Do you have any suggestions for speeding up the experiments?

Thank you in advance for your time!

AntreasAntoniou commented 5 years ago

Your training times sound solid if you ask me. To utilize GPUs better one has to fully parallelize the MAML computational graph which is currently not easy to do with Pytorch. However, I have been working on doing that. The most important component required to do that is a convolutional layer that can receive a batch of different parameters and a batch of batches of images, to apply a convolution operation on. Have a look at: https://github.com/pytorch/pytorch/issues/17983.

Furthermore, the fact that fprop uses only 10% and then backprop around 70-80%, clearly showcases that the two operations are slow enough that a user can actually see the performance oscillating. Normally, fprop is executed within a couple of milliseconds, whereas backprop dominates the computational usage, hence appearing as 100% utilization most of the time, due to backprop being what is mostly done by the GPU. So, to speed this up, you need to parallelize the inner loop optimization process. Then GPU utilization should be better.

ptkin commented 5 years ago

Thank you for your reply! I have watched the link you gave and agree with you that parallelizing the inner optimization is really the key.

Here I also give some of my insights. I think this phenomenon is largely due to the special limitations under few-shot scenarios. My considerations are as follows:

All in all, it is a very sensible decision to apply more parallelization, including parallelizing inner loop optimization of multiple tasks you proposed in the link.

Thanks again for your reply, suggestions, and contributions for open-sourcing.