Describe the bug
I am trying to run llama3.1 8b instruct on aws sagemaker using SMP V2. In recent release on 17 october 2024, the SMP library team released SMP V2.6 which resolves RoPE error fix which encountered in previous releases. The docker image they gave which is "658645717510.dkr.ecr.us-east-1.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121" gives new error whenever place in estimator in URI parameter. The error says "cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp'". If I use any other previous docker images which is less than SMP V2 2.6, that gives rope scaling not supported yet. But SMP V2 specifically mentions it resolves that but the docker image doesn't seem to compatibility with other files.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Clone my repo on aws sagemaker, the main file is llama31.ipynb which you will run it.
You need to have huggingface token for llama3.1.
setup s3 bucket paths
In llama31.ipynb there is estimator code in which I have used latest docker image as described above.
Just run the llama31.ipynb file.
Logs
If applicable, add logs to help explain your problem.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/b-evaluation/Far/fsdp-without-LlamaFactory/1_Default_LLAMA31.py", line 320, in
smp_estimator.fit(inputs=data_channels)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1370, in fit
self.latest_training_job.wait(logs=logs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2742, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 5945, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8547, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8611, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job baykar-fsdp-without-llamafactory-model--2024-10-25-11-30-46-651: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ImportError: cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py)
Traceback (most recent call last)
File "/opt/ml/code/train.py", line 7, in
import train_lib
File "/opt/ml/code/train_lib.py", line 21, in
from checkpoints import (
File "/opt/ml/code/checkpoints.py", line 25, in
from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint
import train_libimport train_lib
from checkpoints import (from checkpoints import (
from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpointfrom torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint
ImportErrorImportError: : cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py)cannot import name 'checkpoint' f. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html
Link to the notebook Add the link to the notebook.
Describe the bug I am trying to run llama3.1 8b instruct on aws sagemaker using SMP V2. In recent release on 17 october 2024, the SMP library team released SMP V2.6 which resolves RoPE error fix which encountered in previous releases. The docker image they gave which is "658645717510.dkr.ecr.us-east-1.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121" gives new error whenever place in estimator in URI parameter. The error says "cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp'". If I use any other previous docker images which is less than SMP V2 2.6, that gives rope scaling not supported yet. But SMP V2 specifically mentions it resolves that but the docker image doesn't seem to compatibility with other files.
Here is the link to latest release: sagemaker release notes
To reproduce A clear, step-by-step set of instructions to reproduce the bug.
Logs If applicable, add logs to help explain your problem.
Traceback (most recent call last): File "/home/ec2-user/SageMaker/b-evaluation/Far/fsdp-without-LlamaFactory/1_Default_LLAMA31.py", line 320, in
smp_estimator.fit(inputs=data_channels)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1370, in fit
self.latest_training_job.wait(logs=logs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2742, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 5945, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8547, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/sagemaker/session.py", line 8611, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job baykar-fsdp-without-llamafactory-model--2024-10-25-11-30-46-651: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ImportError: cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py)
Traceback (most recent call last)
File "/opt/ml/code/train.py", line 7, in
import train_lib
File "/opt/ml/code/train_lib.py", line 21, in
from checkpoints import (
File "/opt/ml/code/checkpoints.py", line 25, in
from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint
import train_libimport train_lib
from checkpoints import (from checkpoints import (
from torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpointfrom torch.sagemaker.distributed.fsdp import checkpoint as tsm_fsdp_checkpoint
ImportErrorImportError: : cannot import name 'checkpoint' from 'torch.sagemaker.distributed.fsdp' (/opt/conda/lib/python3.11/site-packages/torch/sagemaker/distributed/fsdp/init.py)cannot import name 'checkpoint' f. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html