Closed yikai-wu closed 4 years ago
Hi,
Yes, the observation you got is right. The entire number of data points of Cifar dataset is 50k. Therefore, if you use only 128 to approximate the top eigenvalues (Trace, ESD), the variance will be pretty large.
On Cifar10, typically 4096 data points can give you a very stable result. In the paper, we always use the entire dataset to compute all Hessian information.
Let me know if you have any further questions.
I tried this code using ResNet34 and run for a multiple of times. Due to my limit of GPU RAM, I have to use a mini batch size of 32, while using Hessian batch size 128. However, the top eigenvalue and trace varies a lot. For example, in 10 runs, the max of top eigenvalue is 1587 and the min is 159. The trace also varies from 1284 to 5054. I thought it may due to small batch size or too few iterations so I changed Hessian batch size to 512 and max iteration to 1024. However, the results are roughly the same in 10 runs.
May I know whether this agree with your results and whether you have some thoughts on the potential cause of this issue?