aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.03k stars 6.75k forks source link

Error for training job failed. reason: algorithmerror: exit code: 127 #1589

Open katreparitosh opened 4 years ago

katreparitosh commented 4 years ago

Hello,

Same to #969

I was training a DistilBERT model on SageMaker instance using fast-bert. I am using the ml.p2.xlarge instance for GPU processing.

When the function downloads the training image from ECR during fit(), I happen to receive "/usr/bin/env: ‘python\r’: No such file or directory". See below -

image

And, at the end of stack-trace received the following - error for training job failed. reason: algorithmerror: exit code: 127

image

Tech Stack-

fast-bert docker image SageMaker NB Instance - ml.t2.medium GPU Compute - ml.p2.xlarge

What could be the reason for this error? My IAM role has all the required permissions.

Kindly help.

mlcl-peter-holberton commented 1 year ago

I had the same issue on windows. It seems to be caused by running Docker builds using files with windows line endings (the \r after python being the giveaway). I solved it by forcing git to keep the unix line endings from the repository, instead of automatically converting them: git clone https://github.com/aws/amazon-sagemaker-examples.git --config core.autocrlf=false

celsofranssa commented 11 months ago

Do you have any directions here?