Closed weiweiy closed 9 months ago
Same issue for A100. Let me know if you want me to use checkpoint_450 or keep 400 for final eval
Right, the model is trained for 450 steps. In the submission open time, I don't have full condition to run evaluation helm on local. I rented GPU on runpod.io and I often set max_steps is 500 (more detail at https://github.com/QuyAnh2005/neurips-llm-challenge/tree/main/notebooks/finetune). However, when going to step 400, I uploaded it to huggingface repo and evaluate manually a few examples and submitted. So, The main reasons for your question:
Sorry about the confusion @weiweiy
Same issue for A100. Let me know if you want me to use checkpoint_450 or keep 400 for final eval
A100 - checkpoint 400 4090 - checkpoint 450
Is it okay? Because
Question? Why re you uploading the model at check-point-400? https://github.com/QuyAnh2005/neurips-llm-challenge/blob/main/finetune-code/4090/train.py#L130 even though your max_steps are set to 450 https://github.com/QuyAnh2005/neurips-llm-challenge/blob/main/finetune-code/4090/train.py#L108
and the model did train for 450 step?