How to use another optimizer?

vinnik-dmitry07 commented 3 years ago

For example, how to use (if possible) the pytorch-optimizer library?

stephenroller commented 3 years ago

Oh wow, that's really nice. I wasn't aware of that library.

You can add in your own classes here (by either overriding optim_opts or just editing TorchAgent):

https://github.com/facebookresearch/ParlAI/blob/fb5c92741243756516fa50073d34e94ba0b6981e/parlai/core/torch_agent.py#L402

If the optimizer has some hyperparameters names that are unusual, you might want to handle setting those arguments here:

https://github.com/facebookresearch/ParlAI/blob/fb5c92741243756516fa50073d34e94ba0b6981e/parlai/core/torch_agent.py#L924-L952

github-actions[bot] commented 3 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

sjscotti commented 3 years ago

Has anyone tried to implement a pytorch optimizer based on the comments above? @stephenroller gave some comments above that are helpful, but It is a bit complicated in my case since I need to use FP16 for my small memory GPU. I am intrigued by the AdaHessian optimizer since it claims to be second order which should speed up slow convergence quite a bit.

sjscotti commented 3 years ago

So I believe I was successful in using the https://github.com/jettify/pytorch-optimizer optimizers in parlai, and even made a few of them fp16 memory efficient. I’ve been trying them out to finetune blenderbot2 (400M) on my domain-specific corpus, so I thought I would provide some suggestions for which ones I found effective. The AdamP seems to work significantly better than the existing Adam (both memory efficient fp16). A small increase in time per iteration, less GPU memory (I’m not sure why that is the case), and it had lower perplexity for 95% of the iterations within epoch 11 compared with Adam for the same training conditions. So I think it is a good candidate for inclusion in an official release.

I also used AdaHessian and found it provided even better results in perplexity per iteration/epoch but it is very costly. I could not get a memory efficient fp16 version to work (for some reason, some of the gradients calculated by backward when using the create_graph = True flag needed by the optimizer would be NaN or 0 with this flag, but they were calculated fine without using it), so I ran it in fp32. Comparing the GPU memory used, it needed up to 29GB of memory (anything above the 8GB of memory on my GPU is using CPU memory as GPU “virtual memory” on my Windows machine) vs 13.3GB for AdamP. So with the extra steps in differentiation in AdaHessian, the higher precision needed, and a greater dependence on virtual GPU memory, it was about 6 times slower per iteration than AdamP. Maybe someone at FB Research could get it for work for fp16, but it would still probably be a lot slower than AdamP. However, I think it is worthwhile to consider incorporating because of the significantly improved optimization results per epoch might be good to get the best results for the last stages of fine tuning.

vinnik-dmitry07 commented 3 years ago

Consider RAdam, MADGRAD, as they have the same number of params as Adam (AdamP +2, AdaHessian +1).

facebookresearch / ParlAI

How to use another optimizer? #3453