aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
197 stars 87 forks source link

[FATAL tini (7)] exec train failed: No such file or directory #254

Open celsofranssa opened 11 months ago

celsofranssa commented 11 months ago

BUG Description I'm trying to automate and scale a large collection of experiments using AWS SageMamker via Python SDK. However, I am facing an error that does not give any direction to resolve it.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="ml...xlarge",
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior The model is expected to start to train and log metrics and losses.

Screenshots or logs

[2023-09-10 23:08:59,329][sagemaker][INFO] - Creating training-job with name: xmtc-2023-09-11-02-08-56-094
2023-09-11 02:09:00 Starting - Starting the training job...
2023-09-11 02:09:18 Starting - Preparing the instances for training......
2023-09-11 02:10:27 Downloading - Downloading input data
2023-09-11 02:10:27 Training - Downloading the training image..................
2023-09-11 02:13:33 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: No such file or directory

2023-09-11 02:14:15 Uploading - Uploading generated training model
2023-09-11 02:14:15 Failed - Training job failed
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/LightningPrototype/run_on_sagemaker.py", line 32, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job xmtc-2023-09-11-02-08-56-094: Failed. Reason: AlgorithmError: , exit code: 127

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

System information A description of your system. Please provide: