Closed satheeshkatipomu closed 2 months ago
@satheeshkatipomu What the tools you convert "Converted Llama2 70B base model checkpoint from huggingface to nemo format"?
I have used convert_llama_hf_to_nemo.py
script to convert llama2 70B model from huggingface format to NeMo format. Here is the exact command
python3 -u /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=/workspace/llama2_models --output_path=/workspace/llama2_models/llama2-70b-base.nemo
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Is there any solution to this issue?
Unable to fine-tune Llama2 70B with FSDP
I am trying to fine-tune Llama2 70B model on a dataset, with TP=4, PP=8 it is working fine. But with FSDP on 6 nodes it is failing with below error
Steps/Code to reproduce bug
Expected behavior
Llama2 70B SFT works fine.
Environment details Image:
nvcr.io/nvidia/nemo:24.03.01.framework
Using slurm cluster.