why checkpoint-400? - Githubissues

weiweiy commented 9 months ago

Question? Why re you uploading the model at check-point-400? https://github.com/QuyAnh2005/neurips-llm-challenge/blob/main/finetune-code/4090/train.py#L130 even though your max_steps are set to 450 https://github.com/QuyAnh2005/neurips-llm-challenge/blob/main/finetune-code/4090/train.py#L108

and the model did train for 450 step?

weiweiy commented 9 months ago

Same issue for A100. Let me know if you want me to use checkpoint_450 or keep 400 for final eval

QuyAnh2005 commented 9 months ago

Right, the model is trained for 450 steps. In the submission open time, I don't have full condition to run evaluation helm on local. I rented GPU on runpod.io and I often set max_steps is 500 (more detail at https://github.com/QuyAnh2005/neurips-llm-challenge/tree/main/notebooks/finetune). However, when going to step 400, I uploaded it to huggingface repo and evaluate manually a few examples and submitted. So, The main reasons for your question:

At the last submission, I don't remember clearly when I submitted checkpoint-400 or checkpoint-500 (higher probability with checkpoint-400). You also can pick any checkpoints between 400 and 450 because I think the difference is not too significant
I want to reproduce as accurately as possible the process by which I trained the model

Sorry about the confusion @weiweiy

QuyAnh2005 commented 9 months ago

Same issue for A100. Let me know if you want me to use checkpoint_450 or keep 400 for final eval

A100 - checkpoint 400 4090 - checkpoint 450

Is it okay? Because

400 is enough for 1 epoch with A100
450 is enough for 1 epoch with 4090 Due to the size of dataset A100 is smaller 4090

QuyAnh2005 / neurips-llm-challenge

why checkpoint-400? #2