aws / amazon-sagemaker-feedback

Amazon SageMaker Public Feedback Dashboard
Creative Commons Attribution Share Alike 4.0 International
6 stars 1 forks source link

Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

Open jasel-lewis opened 11 months ago

jasel-lewis commented 11 months ago

Product Version

Issue Description

I was using SageMaker Studio to domain-train a model (base model: huggingface-llm-mistral-7b) using a ml.g5.24xlarge instance. I left all values at default other than pointing it to specific buckets for the training data and to output the trained model and adjusted the hyperparameters with:

At just over an hour (3,909 seconds) into the training run, I received the error:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError( ValueError DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:root:Subprocess script failed with return code: 1 Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling subprocess.run(command, shell=shell, check=True) File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '3', '--gradient_accumulation_steps', '8', '--per_device_train_batch_siz

I came across this specific post, but don't believe these to be values I can adjust via SageMaker Studio.

Any thoughts on this?

Expected Behavior

Expected the model to be domain-trained successfully.

Observed Behavior

Observed the error identified in the Issue Description section.

Product Category

JumpStart

Feedback Category

Reliability and Stability

Other Details

No response

poojak13 commented 11 months ago

Hi @jasel-lewis, thanks for raising this. I will pull in someone who can answer this.

jasel-lewis commented 11 months ago

@poojak13 Wonderful! Any help is greatly appreciated, thank you...

FYSA @shieldsjared

jasel-lewis commented 10 months ago

Update to reference a similar re:Post thread.

jasel-lewis commented 10 months ago

Update: Converted to AWS support ticket for faster resolution.

ashwaniyadav09 commented 10 months ago

Facing the same issue.