Use of AdaHessian with batched training data?

sjscotti commented 3 years ago

Hi I recently started using the version of AdaHessian from https://github.com/jettify/pytorch-optimizer in the facebookresearch parlai system to see how it works for training chatbots. I am not very experienced in the discipline so please excuse my clumsy use of the terminology here. It seems the approaches for training they use divide the training data into minibatches. In a given training epoch, they cycle through the minibatches where for each minibatch they compute and backpropagate the loss for that minibatch to get the gradient of the loss with respect to model parameters and then do a gradient descent step to update model parameters. I haven’t seen any discussion of using batches with AdaHessian. Does that mean that AdaHessian doesn’t work with this batching approach, and all the training samples should be used in the computation of loss and gradient of the loss?

Also, can you please confirm that the version of AdaHessian in pytorch-optimizer is the most current version of the code?

Thanks!

amirgholami commented 3 years ago

Hi,

Thanks for reaching out. Yes AdaHessian supports mini-batching, and that is actually the case for all the examples that we have provided in this repository.

Either the code under pytorch-optimizer or this repo works. If you want to use it for computer vision tasks and/or NLP tasks then it may be easier to use this repo since we already have demos that you can start with.

Best, -Amir

sjscotti commented 3 years ago

Thanks! BTW, as you suggested, I just looked at your examples in this repo and noticed that adahessian/transformer/fairseq/optim/adahessian.py version of AdaHessian seems to use fp16. Is that correct? If so, I would like to use that version in my chatbot training for the memory efficiency and speed of fp16. It appears that I would only need two files from your repo: the above mentioned adahessian.py file and adahessian/transformer/fairseq/optim/fairseq_optimizer.py . Does that sound right? Do you have any lessons learned from using this version that you would could pass along? Thanks for your help!

amirgholami / adahessian

Use of AdaHessian with batched training data? #20