chankh / donkeycar-sagemaker

Build an autonomous car using Amazon SageMaker
7 stars 4 forks source link

Run Sagemaker notebook instance at 2nd time #2

Open chapmantam opened 6 years ago

chapmantam commented 6 years ago

Hi, When I run the sagemaker notebook instance first time, it can generate a model file. When I run it at the second time, it will show the following errors. Do you know how to fix it. ? Thank you.

INFO:sagemaker:Creating training-job with name: donkey-2018-11-01-01-52-54-297
2018-11-01 01:52:54 Starting - Starting the training job...
2018-11-01 01:52:58 Starting - Launching requested ML instances......
2018-11-01 01:54:02 Starting - Preparing the instances for training...
2018-11-01 01:54:52 Downloading - Downloading input data...
2018-11-01 01:55:21 Training - Training image download completed. Training in progress.
2018-11-01 01:55:21 Uploading - Uploading generated training model.
/opt/program/env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
using donkey v2.2.1 ...
loading config file: /opt/program/d2/config.py
config loaded
tub_names /opt/ml/input/data/training/*
TubGroup:tubpaths: ['/opt/ml/input/data/training/tub_2018-10-20_410', '/opt/ml/input/data/training/output']
path_in_tub: /opt/ml/input/data/training/tub_2018-10-20_410
Tub exists: /opt/ml/input/data/training/tub_2018-10-20_410
path_in_tub: /opt/ml/input/data/training/output
Tub exists: /opt/ml/input/data/training/output
Traceback (most recent call last):
  File "/opt/program/d2/manage.py", line 190, in <module>
    train(cfg, tub, model)
  File "/opt/program/d2/manage.py", line 155, in train
    tubgroup = TubGroup(tub_names)
  File "/opt/program/donkeycar/parts/datastore.py", line 654, in __init__
    tubs = [Tub(path) for path in tub_paths]
  File "/opt/program/donkeycar/parts/datastore.py", line 654, in <listcomp>
    tubs = [Tub(path) for path in tub_paths]
  File "/opt/program/donkeycar/parts/datastore.py", line 164, in __init__
    with open(self.meta_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/output/meta.json'

2018-11-01 01:55:27 Failed - Training job failed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-17e063b82a48> in <module>()
      7                        sagemaker_session=sess)
      8 
----> 9 tree.fit(data_location)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    192         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    193         if wait:
--> 194             self.latest_training_job.wait(logs=logs)
    195 
    196     @classmethod

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
    438     def wait(self, logs=True):
    439         if logs:
--> 440             self.sagemaker_session.logs_for_job(self.job_name, wait=True)
    441         else:
    442             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
    932 
    933         if wait:
--> 934             self._check_job_status(job_name, description, 'TrainingJobStatus')
    935             if dot:
    936                 print()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
    639         if status != 'Completed' and status != 'Stopped':
    640             reason = desc.get('FailureReason', '(No reason provided)')
--> 641             raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
    642 
    643     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training donkey-2018-11-01-01-52-54-297: Failed Reason: AlgorithmError: Exit Code: 1
chankh commented 6 years ago

The path where model is stored in the first training is imported as part of training in the second round of training. Need to separate the paths in S3 bucket, I will create a fix for this.

chankh commented 6 years ago

@chapmantam this issue should be fixed with the latest stack and notebook