birkhoffkiki / GPFM

The official implementation of GPFM
Other
40 stars 1 forks source link

Questions regarding pretraining. #7

Open boqchen opened 1 month ago

boqchen commented 1 month ago

Hi,

Thanks for the great work. It seems like the pretraining takes a long time. I would like to run the pretraining but I cannot submit a job for such a long time. I was wondering if it is possible to resume the training from a checkpoint.

Thanks in advance!

birkhoffkiki commented 1 month ago

Do you mean that you want to obtain the checkpoint not the pretrained model? choice 1: You could load the pretrained weight to init teacher and student, this may be helpful. choice 2: we can share the checkpoint for you, but it requires 2 nodes and it is prettey huge.

boqchen commented 1 week ago

Hi @birkhoffkiki. I am sorry for the late reply.

I have a single node with 4 H100 GPUs (96Gb). I was wondering if this is sufficient to run GPFM. I tried to run dinov2 vitl14 and it works for me. Do you also have a checkpoint for vitl14?

Thanks for your time!

birkhoffkiki commented 1 week ago

The released pretrained weight is ViT-L-14. You can try to load it. There maybe mismatch of keys. You could solve this by rename all keys of weight.

boqchen commented 1 week ago

Thanks for your prompt reply and releasing the model! In you training config, I only saw you load pretrained model to the student. Since the teacher is just an ema update of the student, I was wondering if this is sufficient. (Also I did not see where I can load pretrained model to the teacher as well).

birkhoffkiki commented 6 days ago

I can provide following checkpoint for you, but its for 2 nodes (16 GPUs). model_0176249.rank_0.pth model_0176249.rank_1.pth .... model_0176249.rank_15.pth If you can convert these into "normal checkpoint" and need it, let me know. I can share you a onedrive link.

boqchen commented 2 days ago

That would be great! Thanks.