Log file and Learning rate

dbsdmlgus50 commented 1 year ago

Thank you for your good research!

We would like to conduct various studies based on your research. However, SSL learning has a problem that it is difficult to check the direction of learning in the middle. Therefore, can I get a log file of the learning pre-train with MillionAID using Swin among the studies you have conducted?

Additionally, I'm trying to run your code, but the learning is not going well due to the lr value problem. I'm trying to upload 256 copies to each GPU with 4 GPUs. How should I set the lr value at this time?

I look forward to hearing from you and thank you.

谢谢你做的好研究！

我们想以你的研究为基础进行多种研究。但是SSL学习存在很难确认中间学习方向的问题。因此，我能收到你进行的研究中，使用Swin用Million AID进行pretrain的学习log文件吗？

我正在尝试额外运行您的代码，但由于lr值问题，学习无法正常进行。打算用4张GPU在每个GPU上上传256张。此时的lr值应该怎么设置呢？

等待答复，并表示感谢。

pUmpKin-Co commented 1 year ago

Hi~Glad to see that our work will be helpful. The log files of swin-b pre-trained on MillionAID can be found here. The actual batch-size for your case is 256 x 4 = 1024. If you find that the original setting is not work out, you can comment out line 131 in main_pretrain.py and then set the learning rate to 0.0625 * sqrt(1024 / 256). This setting follow the recommendation of Adan optimizer. If the training is still unstable, you can try using the adamw optimizer first and then tweak the Adan learning rate to 5-10 times of it. If the problem continues, further decrease the learning rate. Hope my suggestions will help you.

dbsdmlgus50 commented 1 year ago

Thank you for your quick and kind reply.

If you look at the log file, you set epoch to 400, but the paper says that MillionAiD pre-trained 200ep, did you stop learning at 200ep after setting it to 400?

pUmpKin-Co commented 1 year ago

Yes, I stoped learning since it took a far more long time than I expected. Even so, I found that the results were satisfactory after evaluating on downstream tasks.

dbsdmlgus50 commented 1 year ago

Thank you so much. If i have any additional questions, I'll ask you again :)

dbsdmlgus50 commented 1 year ago

Excuse me could you provide me with a reconstruction image?

I would appreciate it if the image was reconstructed from the later epoch. (ex:epoch90_iter514.png..etc)

Thank you

pUmpKin-Co commented 1 year ago

Hi~This is the reconstructed images of the last epoch (epoch199_iter2999.png).

dbsdmlgus50 commented 1 year ago

Thank you for your answers and sharing images! But why is there less masking?

If I turn the code you provided, there will be a lot of masking parts?

pUmpKin-Co commented 1 year ago

Hi~The origional experiments on MillionAID are implemented on our university's High Performance Computing Center and I didn't save the reconstructed results.

The above result is my recent experiment that using Swinv2 as backbone and turn the Mask Ratio to 0.3 Moreover, a month earlier we pre-trained a new vision backbone (we cannot provide details as the backbone is still being experimented with) using the origional setting of CMID on MillionAID dataset. And the reconstructed image of last epoch (still, epoch199_iter2999) is shown below. Hope our efforts will help you.

dbsdmlgus50 commented 1 year ago

Thank you for your kind response even during the new experiment!

It was good at reconstruct from 1 to 10 epochs, but it was weird to reconstrcut from 11 epochs. Did you have this problem, too? The image of loss still drops, but reconstruction is strange.

pUmpKin-Co commented 1 year ago

Hi~I didn't notice the same problem. But I admit that neither the loss nor the reconstruction quality is an indicator for the model performance. In my recent experiment, small model (Swin-B) is well-training after about 130-140 epochs on MillionAID dataset. Further training will not improve or even degrade the model performance. However, the larger model is more benefical from longer pre-training. I encourage you to validate the performance of pre-trained model every epoch using KNN classification to monitor the training process. The possible implementation of KNN classification can be found at DINO repo. And you can implement all the KNN procedure by implementing the hook at Exploring/hook/knn_eval_hook.py (the uploaded implementation is not perfect and you may need to implement in yourself.)

dbsdmlgus50 commented 1 year ago

Thank you so much :)

NJU-LHRS / official-CMID

Log file and Learning rate #12

I look forward to hearing from you and thank you.