Megatron-LM fine-tuning: No such file or directory model_optim_rng.pt

bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models

https://arxiv.org/abs/2308.07124

MIT License

420 stars 27 forks source link

Megatron-LM fine-tuning: No such file or directory model_optim_rng.pt #14

Open zxyscz opened 1 year ago

zxyscz commented 1 year ago

i want use Megatron-LM fine-tuning, but i run the process, get. a error : No such file or directory model_optim_rng.pt model_optim_rng.pt

Muennighoff commented 1 year ago

Hmm do you have --no_load_optim and --no_load_rng in your script?

model_optim_rng.pt files are not needed and not in the checkpoint I think

zxyscz commented 1 year ago

Hmm do you have --no_load_optim and --no_load_rng in your script?

model_optim_rng.pt files are not needed and not in the checkpoint I think

i run now ,but i did not how to merge many chekpoint and convert to hugging face format, can you help me?

Muennighoff commented 1 year ago

This is the script for merging & converting: https://github.com/bigcode-project/octopack/blob/4f0e261d11d41ca62ce09204716338889b7800f4/training/convert_large.sh#L4 Let me know if it does not work for you

zxyscz commented 1 year ago

yes, but i should git clone which branch , is this https://github.com/bigcode-project/Megatron-LM/pull/40? , when i use mft branch , it did not run.

Muennighoff commented 1 year ago

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

zxyscz commented 1 year ago

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

thx

zxyscz commented 1 year ago

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

Muennighoff commented 1 year ago

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

Yeah that's why we only fine-tune for few steps, e.g. OctoCoder is only fine-tuned for 2M tokens.

zxyscz commented 1 year ago

Yeah that one; It is already merged into main so you can probably also use the main branch It was merged slightly after the mtf branch was created hence the code is not in the mtf branch, but you can maybe also emerge main into the mtf branch if you want to

i have a question, i find that humaneval python@1 value reduced a lot after fintune.

Yeah that's why we only fine-tune for few steps, e.g. OctoCoder is only fine-tuned for 2M tokens. In addition ,I evaluated starcoderbase-25000 step , the humaneval pass@1 value was 25%, it is lower than 30%. Is it because open source starcoderbase-megatron lm is not final checkpoint?

Muennighoff commented 1 year ago

What script are you using to evaluate it? That may explain the small difference. It should be the final checkpoint.

zxyscz commented 1 year ago

What script are you using to evaluate it? That may explain the small difference. It should be the final checkpoint.

First,I convert this checkpoint to hf format, then using greedy decoding to evaluate.

zxyscz commented 1 year ago

i convert to hf format , is it right ?

Muennighoff commented 1 year ago

Yeah that looks correct. I think for pass@1 HumanEval StarCoder is evaluated using temperature=0.2. Also I would set n_samples=20.

zxyscz commented 1 year ago

Yeah that looks correct. I think for pass@1 HumanEval StarCoder is evaluated using temperature=0.2. Also I would set n_samples=20. I convert megatron model to hf as shown below, loading models is slow. How can i convert model many partition like this：

Muennighoff commented 1 year ago

You have to shard it into multiple files when saving it

zxyscz commented 1 year ago

You have to shard it into multiple files when saving it

How to shard it into multiple files? Is there any code to refer to？