Closed gramesh-amd closed 2 months ago
cc: @aireenmei @rwitten
It would be great if you could share the maxtext converted ckpt. Would save a lot of time/resources
I thin that's our internal bucket for testing. @ZhiyuLi-goog, do you know if we have public maxtext or paxml ckpt for gpt3?
I think that's our internal bucket for testing. @ZhiyuLi-goog, do you know if we have public maxtext or paxml ckpt for gpt3?
@gramesh-amd @aireenmei
I think this bucket gs://mlperf-llm-public2
is public one as you can see from the released mlperf reference implementation
I was able to read this bucket without any additional access granted.
cc Yuechao @sgpyc the owner of the bucket just for double confirmation.
Thanks
@ZhiyuLi-goog, I got the paxml ckpts after asking here
Are there any plans to also share the maxtext ckpt? (the conversion script says its very resource demanding, so it would be great if you guys could share it?)
@ZhiyuLi-goog Checking one last time, if you guys could share the maxtext gpt3 ckpt
I'm been working on converting it but running into OOM issues
@gobbleturk Do you have some info?
@ZhiyuLi-goog Checking one last time, if you guys could share the maxtext gpt3 ckpt I'm been working on converting it but running into OOM issues
We should have converted ckpt. Let me double check with internal team about how to open source it.
Thanks that would be great
I also managed to convert the ckpt using convert_gpt3_ckpt_from_paxml.py script but have trouble loading it to start training. I'll open a separate issue for this
Quick Update after checking with internal team.
The checkpoint included in the model artifacts is currently not easy to share outside of google. Please let me know if you need help with converting or loading the checkpoint.
We cannot load the checkpoint converted using the either the main branch or the code that was checked into the MLPerf repo with your MLPerf training submission. The main branch of this submission has a problem due to a dimension mismatch - I think this is because the conversion script does not support the pipeline parallellism dimension.
When we try to use the code from the mlperf trainig submission (in the maxtext_fork directory in the conversion) ti convert and load the checkpoint using either load_full_state_path or base_output_directory, we get a dictionary mismatch error like this:
1: I0911 08:32:54.909472 139922234083136 checkpointer.py:227] Restoring checkpoint from /mnt/m2m_nobackup/users/user/gpt3-conversion-forked/checkpoints/4000.
1: Traceback (most recent call last):
1: File "/maxtext_fork/MaxText/train.py", line 558, in
Hello,
I am trying to use paxml to maxtext ckpt conversion script but dont seem to have permissions to download the gpt3 ckpt.
So just wanted to check if there are updated instructions to do this or if I could get view/download access