Closed ayeeyecorp closed 1 year ago
@tangbinh have you personally confirmed new checksums udpated here outside of the work @mawilson1234 has done in #702?
Yes. The checksums mentioned in the README are the ones we have from our end.
Have you made sure to include --skip-optimizer-state True
and --unflatten-weights
when running reshard_fsdp
? Also, have you tried to load the resharded checkpoint and check the generated outputs?
@tangbinh the parameters were set correctly. However, I now see there was just an update made to the setup for cuda 11.6.
Original instructions:
pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# install apex
git clone https://github.com/NVIDIA/apex.git
cd ~/apex
git checkout 265b451de8ba9bfcb67edc7360f3d8772d0a8bea
pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
# install megatron (requires min. 1 GPU or get error with install)
git clone --branch fairseq_v3 https://github.com/ngoyal2707/Megatron-LM.git
cd ~/Megatron-LM
pip3 install six regex
pip3 install -e .
# Install fairscale
git clone https://github.com/facebookresearch/fairscale.git
cd ~/fairscale
git checkout fixing_memory_issues_with_keeping_overlap_may24
pip3 install -e .
# Install metaseq
git clone https://github.com/facebookresearch/metaseq.git
cd ~/metaseq
pip3 install -e .
# turn on pre-commit hooks
pre-commit install
Original instructions were different with PyTorch cuda11.3 + apex + megatron. Could that have been the issue? Will try resharding now and post results.
That was it - was using outdated setup instructions. Thanks @tangbinh
@tangbinh have you personally confirmed new checksums udpated here outside of the work @mawilson1234 has done in #702? Using setup instructions followed by
reshard_fsdp
instructions I received the following new md5sum checksums for reshards:I have also confirmed the md5sums for the 992 original files are all correct and my environment setup is identical to @mawilson1234's above aside from python 3.9 running in conda on EC2 and I am also running CUDA 11.3.1 and have NVIDIA driver; otherwise the Megatron and metaseq packages won't compile without nvcc and at least one GPU. If the checksums indeed are correct, I am not sure what could be wrong since all I also changed were the input and output directories. Thoughts?
Additionally, will the updates in #698 cause
reshard_fsdp
not to work correctly if I have copies of the 8 resharded checkpoints with the original checksums?