Confirm md5sums after running reshard_fsdp.py on OPT-175B #702

ayeeyecorp commented 1 year ago

@tangbinh have you personally confirmed new checksums udpated here outside of the work @mawilson1234 has done in #702? Using setup instructions followed by reshard_fsdp instructions I received the following new md5sum checksums for reshards:

b0ea734369ec223ebdaea79929ea9ff8  reshard-model_part-0.pt
0782475925ef19a125614aeaf55cbfcb  reshard-model_part-1.pt
c29569635ec1d6ef7348b9331f7eeafb  reshard-model_part-2.pt
e2a3fa645c52ad8efc9b91a7d8912431  reshard-model_part-3.pt
04b2d47e42ef253c84e5c6748ccf5789  reshard-model_part-4.pt
40e3bb060ad75c915658b483a6021850  reshard-model_part-5.pt
84d0a78282e3d9fc21fff71b05b21661  reshard-model_part-6.pt
4cd9a318c75edea1584b610cc99dd0c8  reshard-model_part-7.pt

I have also confirmed the md5sums for the 992 original files are all correct and my environment setup is identical to @mawilson1234's above aside from python 3.9 running in conda on EC2 and I am also running CUDA 11.3.1 and have NVIDIA driver; otherwise the Megatron and metaseq packages won't compile without nvcc and at least one GPU. If the checksums indeed are correct, I am not sure what could be wrong since all I also changed were the input and output directories. Thoughts?

Additionally, will the updates in #698 cause reshard_fsdp not to work correctly if I have copies of the 8 resharded checkpoints with the original checksums?

tangbinh commented 1 year ago

@tangbinh have you personally confirmed new checksums udpated here outside of the work @mawilson1234 has done in #702?

Yes. The checksums mentioned in the README are the ones we have from our end.

Have you made sure to include --skip-optimizer-state True and --unflatten-weights when running reshard_fsdp? Also, have you tried to load the resharded checkpoint and check the generated outputs?

ayeeyecorp commented 1 year ago

@tangbinh the parameters were set correctly. However, I now see there was just an update made to the setup for cuda 11.6.

Original instructions:

pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# install apex
git clone https://github.com/NVIDIA/apex.git
cd ~/apex
git checkout 265b451de8ba9bfcb67edc7360f3d8772d0a8bea
pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./

# install megatron (requires min. 1 GPU or get error with install)
git clone --branch fairseq_v3 https://github.com/ngoyal2707/Megatron-LM.git
cd ~/Megatron-LM
pip3 install six regex
pip3 install -e .

# Install fairscale
git clone https://github.com/facebookresearch/fairscale.git
cd ~/fairscale
git checkout fixing_memory_issues_with_keeping_overlap_may24
pip3 install -e .

# Install metaseq
git clone https://github.com/facebookresearch/metaseq.git
cd ~/metaseq
pip3 install -e .
# turn on pre-commit hooks
pre-commit install

Original instructions were different with PyTorch cuda11.3 + apex + megatron. Could that have been the issue? Will try resharding now and post results.

ayeeyecorp commented 1 year ago

That was it - was using outdated setup instructions. Thanks @tangbinh

facebookresearch / metaseq

Confirm md5sums after running reshard_fsdp.py on OPT-175B #702 #711