Open Advaid-Deepak opened 6 months ago
Hi @adityakusupati , This is Prateek Chanda from GRI. @Advaid-Deepak and me were experimenting with Matformer OLmo for trying out a few ideas externally and were facing some issues with finetuning with a matformer checkpoint shown above.
Would really appreciate if you could kindly point out any steps which we possibly missed.
Thanks 😄
Hi Prateek and Advaid,
Thanks for your interest. I am unsure as to what is happening here as well. MatFormer-OLMo models are not that competitive either to do any experiments (barring scaling laws) and get meaningful results.
The only good MatFormer models publicly released at the MatViT models in scenic which are actually SOTA as regular ViT models and a drop in replacement.
As of now I am unable to look at this closely and can only do so after May 2nd week. The script and readme is what I used to restart my trained runs when something failed for ckpt, so that will imply fine-tuning should work similarly.
Sorry for not being of much help here. Aditya
We were trying to finetune a Matformer checkpoint ( MatFormer-OLMo-180M Link )
We used the following command to call the training script
where the folder mentioned in load_path is obtained by download from the link mentioned in the README for MatFormer-OLMo-180M .
However running this gives us the following error
We are unable to resolve this issue
We tried adding the following line to torch/distributed/fsdp/_init_utils.py
But this operation gives another error as follows
We have made other changes to pile-tiny.yaml , scripts/train.py and scripts/util.py to make it compatible for training I am attaching a zip of those files here : changes.zip
Apart from this we were facing another issue
However we circumvented this issue by commenting out the raise error (within torch/distributed/fsdp/_init_utils.py ) as follows
I have attached the entire file within changes.zip , just in case