aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.15k stars 6.78k forks source link

cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' #4772

Open RoshaanNbs opened 4 weeks ago

RoshaanNbs commented 4 weeks ago

Link to the notebook Add the link to the notebook.

Describe the bug I am trying to run llama3.1 8b instruct on aws sagemaker using SMP V2. In recent release on 17 october 2024, the SMP library team released SMP V2.6 which resolves RoPE error fix which encountered in previous releases. The docker image they gave which is "658645717510.dkr.ecr.us-east-1.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121" gives new error whenever place in estimator in URI parameter. The error says "cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp'". If I use any other previous docker images which is less than SMP V2 2.6, that gives rope scaling not supported yet. But SMP V2 specifically mentions it resolves that but the docker image doesn't seem to compatibility with other files.

Here is the link to latest release: sagemaker release notes

To reproduce A clear, step-by-step set of instructions to reproduce the bug.

  1. Clone my repo on aws sagemaker, the main file is llama31.ipynb which you will run it.
  2. You need to have huggingface token for llama3.1.
  3. setup s3 bucket paths
  4. In llama31.ipynb there is estimator code in which I have used latest docker image as described above.
  5. Just run the llama31.ipynb file.

Logs If applicable, add logs to help explain your problem.

Traceback (most recent call last): File "/home/ec2-user/SageMaker/b-evaluation/Far/fsdp-without-LlamaFactory/1_Default_LLAMA31.py", line 320, in smp_estimator.fit(inputs=data_channels) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper return run_func(*args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1370, in fit self.latest_training_job.wait(logs=logs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2742, in wait self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 5945, in logs_for_job _logs_for_job(self, job_name, wait, poll, log_type, timeout) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8547, in _logs_for_job _check_job_status(job_name, description, "TrainingJobStatus") File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8611, in _check_job_status raise exceptions.UnexpectedStatusException( sagemaker.exceptions.UnexpectedStatusException: Error for Training job baykar-fsdp-without-llamafactory-model--2024-10-25-11-30-46-651: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "ImportError: cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py) Traceback (most recent call last) File "/opt/ml/code/train.py", line 7, in import train_lib File "/opt/ml/code/train_lib.py", line 21, in from checkpoints import ( File "/opt/ml/code/checkpoints.py", line 25, in from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint import train_libimport train_lib from checkpoints import (from checkpoints import ( from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpointfrom torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint ImportErrorImportError: : cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py)cannot import name 'checkpoint' f. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html