Questions about MAML training time

ptkin commented 5 years ago

Hi Antreas,

Thank you for your wonderful work MAML++!

Currently I'm doing some few-shot learning experiments based on your code, but I found that the experiments take such a long time to converge, so I would like to ask some questions.

I experimented with a single V100 GPU, where the 5-way 1-shot miniImagenet MAML experiment takes about 5 hours (totally 500*100 iterations), and the Omniglot experiment takes about 15 hours. I think the training time is a bit too long, so I'm wondering how much time the experiments you conducted took?

In addition, I found that it doesn't make good use of the GPU's computing power (high-order derivative enabled), the GPU utilization is constantly changing and no more than 50%. I double-checked the training process and found that the main time overhead was in the loss.backward step (more than 70%). So I think the reason why the training process takes such a long time is that too little support set data can't fully exploit the computing advantages of the GPU, and high-order derivative calculation consumes so much time. What do you think about this problem? Do you have any suggestions for speeding up the experiments?

Thank you in advance for your time!

AntreasAntoniou commented 5 years ago

Your training times sound solid if you ask me. To utilize GPUs better one has to fully parallelize the MAML computational graph which is currently not easy to do with Pytorch. However, I have been working on doing that. The most important component required to do that is a convolutional layer that can receive a batch of different parameters and a batch of batches of images, to apply a convolution operation on. Have a look at: https://github.com/pytorch/pytorch/issues/17983.

Furthermore, the fact that fprop uses only 10% and then backprop around 70-80%, clearly showcases that the two operations are slow enough that a user can actually see the performance oscillating. Normally, fprop is executed within a couple of milliseconds, whereas backprop dominates the computational usage, hence appearing as 100% utilization most of the time, due to backprop being what is mostly done by the GPU. So, to speed this up, you need to parallelize the inner loop optimization process. Then GPU utilization should be better.

ptkin commented 5 years ago

Thank you for your reply! I have watched the link you gave and agree with you that parallelizing the inner optimization is really the key.

Here I also give some of my insights. I think this phenomenon is largely due to the special limitations under few-shot scenarios. My considerations are as follows:

In general deep learning-based CV tasks like image classification, our data batch is large enough, so that we can fully utilize the GPU computing power (ignoring the time of data reading and preprocessing) and keep the GPU utilization at a very high level (up to 100%).
But in few-shot scenarios, our data batch is too small (e.g., 5-way 1-shot), so when performing inner optimization, we only need a small amount of GPU computing power to meet the needs. At this time the other parts like task sampling or data preprocessing will take up a large proportion of the time, and plenty of computing power is wasted.
In contrast, during outer optimization the huge amount of computation required to calculate the high-order derivative allows us to take full advantage of the GPU.
Therefore, the difference between inner optimization and outer optimization causes the GPU usage to be changing and not being full.

All in all, it is a very sensible decision to apply more parallelization, including parallelizing inner loop optimization of multiple tasks you proposed in the link.

Thanks again for your reply, suggestions, and contributions for open-sourcing.

AntreasAntoniou / HowToTrainYourMAMLPytorch

Questions about MAML training time #10