aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.14k forks source link

Local Mode: No such File or Directory #656

Closed cSchubes closed 4 years ago

cSchubes commented 5 years ago

Please fill out the form below.

System Information

Describe the problem

We are attempting to follow the process outlined in this example to get us started with local SageMaker. We have everything running on an t2.mirco ec2 instance, and have put everything from the notebook linked above into a script. However, the script is failing with a file not found error. The relevant code is below:

inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')

from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.12.0',
                             training_steps=10, 
                             evaluation_steps=10,
                             train_instance_count=2,
                             train_instance_type='local')

mnist_estimator.fit(inputs)

The call to upload_data is working (verified by checking S3). However, the SageMaker training code is not able to find this information.

Minimal repro / logs

Log:

Traceback (most recent call last):
 File "test.py", line 36, in <module>
   mnist_estimator.fit(inputs)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py", line 336, in fit
   fit_super()
 File "/usr/local/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py", line 315, in fit_super
   super(TensorFlow, self).fit(inputs, wait, logs, job_name)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/estimator.py", line 236, in fit
   self.latest_training_job = _TrainingJob.start_new(self, inputs)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/estimator.py", line 578, in start_new
   estimator.sagemaker_session.train(**train_args)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/session.py", line 320, in train
   self.sagemaker_client.create_training_job(**train_request)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/local/local_session.py", line 74, in create_training_job
   training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/local/entities.py", line 70, in start
   self.model_artifacts = self.container.train(input_data_config, output_data_config, hyperparameters, job_name)
 File "/usr/local/lib/python2.7/site-packages/sagemaker/local/image.py", line 130, in train
   process = subprocess.Popen(compose_command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
 File "/usr/lib64/python2.7/subprocess.py", line 390, in __init__
   errread, errwrite)
 File "/usr/lib64/python2.7/subprocess.py", line 1025, in _execute_child
   raise child_exception
OSError: [Errno 2] No such file or directory
mvsusp commented 5 years ago

Hi @cSchubes ,

It seems that local mode is failing to execute Docker compose to start the containers for training. Are you able to run docker and docker compose yourself in the instance?

Thanks for using SageMaker.

laurenyu commented 5 years ago

hi @cSchubes, is this still an issue for you? another thing to check would be that the training script mnist.py is in the same directory as your test script.

cSchubes commented 5 years ago

we moved away from the API due to time constraints on the project and instead are running our own training code on EC2 instances. However, I am interested in this going forward - I can post an update here when I get the chance to try out your suggestions.

laurenyu commented 5 years ago

@cSchubes thanks for the response. if you do get a chance to revisit trying out SageMaker, you may also be interested in Script Mode (for details, see https://sagemaker.readthedocs.io/en/stable/using_tf.html) - it should allow you to run your training script that you're using on EC2 with minimal modification in SageMaker.

samlovestech commented 5 years ago

also having the same problem " [Errno 2] No such file or directory: 'docker': 'docker'" when i use the localSession and train_instance_type = 'local'..... Why the local_mode documentation is so poor?

laurenyu commented 4 years ago

sorry for the delayed response here - usually [Errno 2] No such file or directory: 'docker': 'docker' indicates that docker is not installed

annaluo676 commented 4 years ago

@laurenyu It seems docker is not installed on SageMaker Studio by default. As a result, I encountered the same error when building a BYOC. What is the best practice here that you'd recommend? Thanks in advance.

laurenyu commented 4 years ago

@annaluo676 unfortunately, the best I can recommend at this time is to build the image elsewhere, e.g. locally, in a SageMaker Notebook Instance, or on an EC2 instance. There's been some planning around fixing this experience, but I don't yet have a timeline to share.