aws-samples / amazon-sagemaker-local-mode

Amazon SageMaker Local Mode Examples
MIT No Attribution
242 stars 59 forks source link

Local Mode Train with parameter `source_dir` is not working #10

Closed orriduck closed 2 years ago

orriduck commented 3 years ago

Hi,

I am attempting to initiate a training job using TensorFlow with given attribute entry_point + source_dir while I am having file not found issue

My file structure is something like this

- folder A:
  - notebook.ipynb (the file I use to call the sagemaker local mode stuff)
- ds_pipeline:
  - src
    - train.py
    - other scripts

code snippet I am using to call this training job

# Training Job
salary_estimator = TensorFlow(
    entry_point='train.py',
    source_dir="../ds_pipeline/src", 
    role=sagemaker.get_execution_role(),
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04",
    instance_count=1,
    instance_type="local",
    output_path = "s3://sagemaker-project-p-zfuf9hgaujxu/experiment_packs/poc_exp/model",
    sagemaker_session=sagemaker_session,
    container_log_level=20, # 10 debug 20 info 30 warning 40 error
    volume_size=80,
    model_dir=False,
    hyperparameters = {
        "default": {
            "train_epochs": 3,
            "train_batch_size": 1024,
            "early_stop_tolerance": 2
        },
        "CA": {
            "train_epochs": 5,
            "train_batch_size": 2048,
            "early_stop_tolerance": 2
        }
    }
)

salary_estimator.fit(
    inputs = {
        "train": TrainingInput(
            s3_data="s3://sagemaker-project-p-zfuf9hgaujxu/experiment_packs/poc_exp/feature_engineering/encoded_train",
            content_type=None,
        ),
        "validation": TrainingInput(
            s3_data="s3://sagemaker-project-p-zfuf9hgaujxu/experiment_packs/poc_exp/feature_engineering/encoded_validation",
            content_type=None,
        ),
        "encoders": TrainingInput(
            s3_data="s3://sagemaker-project-p-zfuf9hgaujxu/experiment_packs/poc_exp/feature_engineering/encoders",
            content_type=None,
        ),
    }
)

error I am getting

Couldn't call 'get_role' to get Role ARN from role name BGTDevSageMakerAdmin to get Role path.
Creating ndnsmnf24z-algo-1-0c5g8 ... 
Creating ndnsmnf24z-algo-1-0c5g8 ... done
Attaching to ndnsmnf24z-algo-1-0c5g8
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:21.993193: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:21.993366: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:21.999214: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:22.031609: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,294 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,300 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,319 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,334 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,350 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,361 sagemaker-training-toolkit INFO     Invoking user script
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | Training Env:
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | {
ndnsmnf24z-algo-1-0c5g8 |     "additional_framework_parameters": {},
ndnsmnf24z-algo-1-0c5g8 |     "channel_input_dirs": {
ndnsmnf24z-algo-1-0c5g8 |         "train": "/opt/ml/input/data/train",
ndnsmnf24z-algo-1-0c5g8 |         "validation": "/opt/ml/input/data/validation",
ndnsmnf24z-algo-1-0c5g8 |         "encoders": "/opt/ml/input/data/encoders"
ndnsmnf24z-algo-1-0c5g8 |     },
ndnsmnf24z-algo-1-0c5g8 |     "current_host": "algo-1-0c5g8",
ndnsmnf24z-algo-1-0c5g8 |     "framework_module": "sagemaker_tensorflow_container.training:main",
ndnsmnf24z-algo-1-0c5g8 |     "hosts": [
ndnsmnf24z-algo-1-0c5g8 |         "algo-1-0c5g8"
ndnsmnf24z-algo-1-0c5g8 |     ],
ndnsmnf24z-algo-1-0c5g8 |     "hyperparameters": {
ndnsmnf24z-algo-1-0c5g8 |         "default": {
ndnsmnf24z-algo-1-0c5g8 |             "train_epochs": 3,
ndnsmnf24z-algo-1-0c5g8 |             "train_batch_size": 1024,
ndnsmnf24z-algo-1-0c5g8 |             "early_stop_tolerance": 2
ndnsmnf24z-algo-1-0c5g8 |         },
ndnsmnf24z-algo-1-0c5g8 |         "CA": {
ndnsmnf24z-algo-1-0c5g8 |             "train_epochs": 5,
ndnsmnf24z-algo-1-0c5g8 |             "train_batch_size": 2048,
ndnsmnf24z-algo-1-0c5g8 |             "early_stop_tolerance": 2
ndnsmnf24z-algo-1-0c5g8 |         }
ndnsmnf24z-algo-1-0c5g8 |     },
ndnsmnf24z-algo-1-0c5g8 |     "input_config_dir": "/opt/ml/input/config",
ndnsmnf24z-algo-1-0c5g8 |     "input_data_config": {
ndnsmnf24z-algo-1-0c5g8 |         "train": {
ndnsmnf24z-algo-1-0c5g8 |             "TrainingInputMode": "File"
ndnsmnf24z-algo-1-0c5g8 |         },
ndnsmnf24z-algo-1-0c5g8 |         "validation": {
ndnsmnf24z-algo-1-0c5g8 |             "TrainingInputMode": "File"
ndnsmnf24z-algo-1-0c5g8 |         },
ndnsmnf24z-algo-1-0c5g8 |         "encoders": {
ndnsmnf24z-algo-1-0c5g8 |             "TrainingInputMode": "File"
ndnsmnf24z-algo-1-0c5g8 |         }
ndnsmnf24z-algo-1-0c5g8 |     },
ndnsmnf24z-algo-1-0c5g8 |     "input_dir": "/opt/ml/input",
ndnsmnf24z-algo-1-0c5g8 |     "is_master": true,
ndnsmnf24z-algo-1-0c5g8 |     "job_name": "tensorflow-training-2021-04-20-02-55-11-498",
ndnsmnf24z-algo-1-0c5g8 |     "log_level": 20,
ndnsmnf24z-algo-1-0c5g8 |     "master_hostname": "algo-1-0c5g8",
ndnsmnf24z-algo-1-0c5g8 |     "model_dir": "/opt/ml/model",
ndnsmnf24z-algo-1-0c5g8 |     "module_dir": "/opt/ml/code",
ndnsmnf24z-algo-1-0c5g8 |     "module_name": "train",
ndnsmnf24z-algo-1-0c5g8 |     "network_interface_name": "eth0",
ndnsmnf24z-algo-1-0c5g8 |     "num_cpus": 8,
ndnsmnf24z-algo-1-0c5g8 |     "num_gpus": 0,
ndnsmnf24z-algo-1-0c5g8 |     "output_data_dir": "/opt/ml/output/data",
ndnsmnf24z-algo-1-0c5g8 |     "output_dir": "/opt/ml/output",
ndnsmnf24z-algo-1-0c5g8 |     "output_intermediate_dir": "/opt/ml/output/intermediate",
ndnsmnf24z-algo-1-0c5g8 |     "resource_config": {
ndnsmnf24z-algo-1-0c5g8 |         "current_host": "algo-1-0c5g8",
ndnsmnf24z-algo-1-0c5g8 |         "hosts": [
ndnsmnf24z-algo-1-0c5g8 |             "algo-1-0c5g8"
ndnsmnf24z-algo-1-0c5g8 |         ]
ndnsmnf24z-algo-1-0c5g8 |     },
ndnsmnf24z-algo-1-0c5g8 |     "user_entry_point": "train.py"
ndnsmnf24z-algo-1-0c5g8 | }
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | Environment variables:
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | SM_HOSTS=["algo-1-0c5g8"]
ndnsmnf24z-algo-1-0c5g8 | SM_NETWORK_INTERFACE_NAME=eth0
ndnsmnf24z-algo-1-0c5g8 | SM_HPS={"CA":{"early_stop_tolerance":2,"train_batch_size":2048,"train_epochs":5},"default":{"early_stop_tolerance":2,"train_batch_size":1024,"train_epochs":3}}
ndnsmnf24z-algo-1-0c5g8 | SM_USER_ENTRY_POINT=train.py
ndnsmnf24z-algo-1-0c5g8 | SM_FRAMEWORK_PARAMS={}
ndnsmnf24z-algo-1-0c5g8 | SM_RESOURCE_CONFIG={"current_host":"algo-1-0c5g8","hosts":["algo-1-0c5g8"]}
ndnsmnf24z-algo-1-0c5g8 | SM_INPUT_DATA_CONFIG={"encoders":{"TrainingInputMode":"File"},"train":{"TrainingInputMode":"File"},"validation":{"TrainingInputMode":"File"}}
ndnsmnf24z-algo-1-0c5g8 | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
ndnsmnf24z-algo-1-0c5g8 | SM_CHANNELS=["encoders","train","validation"]
ndnsmnf24z-algo-1-0c5g8 | SM_CURRENT_HOST=algo-1-0c5g8
ndnsmnf24z-algo-1-0c5g8 | SM_MODULE_NAME=train
ndnsmnf24z-algo-1-0c5g8 | SM_LOG_LEVEL=20
ndnsmnf24z-algo-1-0c5g8 | SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
ndnsmnf24z-algo-1-0c5g8 | SM_INPUT_DIR=/opt/ml/input
ndnsmnf24z-algo-1-0c5g8 | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
ndnsmnf24z-algo-1-0c5g8 | SM_OUTPUT_DIR=/opt/ml/output
ndnsmnf24z-algo-1-0c5g8 | SM_NUM_CPUS=8
ndnsmnf24z-algo-1-0c5g8 | SM_NUM_GPUS=0
ndnsmnf24z-algo-1-0c5g8 | SM_MODEL_DIR=/opt/ml/model
ndnsmnf24z-algo-1-0c5g8 | SM_MODULE_DIR=/opt/ml/code
ndnsmnf24z-algo-1-0c5g8 | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"encoders":"/opt/ml/input/data/encoders","train":"/opt/ml/input/data/train","validation":"/opt/ml/input/data/validation"},"current_host":"algo-1-0c5g8","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1-0c5g8"],"hyperparameters":{"CA":{"early_stop_tolerance":2,"train_batch_size":2048,"train_epochs":5},"default":{"early_stop_tolerance":2,"train_batch_size":1024,"train_epochs":3}},"input_config_dir":"/opt/ml/input/config","input_data_config":{"encoders":{"TrainingInputMode":"File"},"train":{"TrainingInputMode":"File"},"validation":{"TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"tensorflow-training-2021-04-20-02-55-11-498","log_level":20,"master_hostname":"algo-1-0c5g8","model_dir":"/opt/ml/model","module_dir":"/opt/ml/code","module_name":"train","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-0c5g8","hosts":["algo-1-0c5g8"]},"user_entry_point":"train.py"}
ndnsmnf24z-algo-1-0c5g8 | SM_USER_ARGS=["--CA","early_stop_tolerance=2,train_batch_size=2048,train_epochs=5","--default","early_stop_tolerance=2,train_batch_size=1024,train_epochs=3"]
ndnsmnf24z-algo-1-0c5g8 | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
ndnsmnf24z-algo-1-0c5g8 | SM_CHANNEL_TRAIN=/opt/ml/input/data/train
ndnsmnf24z-algo-1-0c5g8 | SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
ndnsmnf24z-algo-1-0c5g8 | SM_CHANNEL_ENCODERS=/opt/ml/input/data/encoders
ndnsmnf24z-algo-1-0c5g8 | SM_HP_DEFAULT={"early_stop_tolerance":2,"train_batch_size":1024,"train_epochs":3}
ndnsmnf24z-algo-1-0c5g8 | SM_HP_CA={"early_stop_tolerance":2,"train_batch_size":2048,"train_epochs":5}
ndnsmnf24z-algo-1-0c5g8 | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | Invoking script with the following command:
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | /usr/local/bin/python3.7 train.py --CA early_stop_tolerance=2,train_batch_size=2048,train_epochs=5 --default early_stop_tolerance=2,train_batch_size=1024,train_epochs=3
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | /usr/local/bin/python3.7: can't open file 'train.py': [Errno 2] No such file or directory
ndnsmnf24z-algo-1-0c5g8 | 
ndnsmnf24z-algo-1-0c5g8 | 2021-04-20 02:56:23,389 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ndnsmnf24z-algo-1-0c5g8 | Command "/usr/local/bin/python3.7 train.py --CA early_stop_tolerance=2,train_batch_size=2048,train_epochs=5 --default early_stop_tolerance=2,train_batch_size=1024,train_epochs=3"
ndnsmnf24z-algo-1-0c5g8 | /usr/local/bin/python3.7: can't open file 'train.py': [Errno 2] No such file or directory
ndnsmnf24z-algo-1-0c5g8 exited with code 1
1
Aborting on container exit...

Any comment will be helpful, thanks

eitansela commented 3 years ago

Hello @ruyyi0323

It looks that source_dir="../ds_pipeline/src", should be source_dir="./ds_pipeline/src",

You can look at this example: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/tensorflow_script_mode_debug_local_training/tensorflow_script_mode_debug_local_training.py

orriduck commented 3 years ago

Hi @eitansela ,

It raises error

Couldn't call 'get_role' to get Role ARN from role name BGTDevSageMakerAdmin to get Role path.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-67b9be0e75b7> in <module>
     39         "encoders": TrainingInput(
     40             s3_data="s3://sagemaker-project-p-zfuf9hgaujxu/experiment_packs/poc_exp/feature_engineering/encoders",
---> 41             content_type=None,
     42         ),
     43     }

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    652 
    653         """
--> 654         self._prepare_for_training(job_name=job_name)
    655 
    656         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in _prepare_for_training(self, job_name)
   2164         # source directory. We are intentionally not handling it because this is a critical error.
   2165         if self.source_dir and not self.source_dir.lower().startswith("s3://"):
-> 2166             validate_source_dir(self.entry_point, self.source_dir)
   2167 
   2168         # if we are in local mode with local_code=True. We want the container to just

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/fw_utils.py in validate_source_dir(script, directory)
     77         if not os.path.isfile(os.path.join(directory, script)):
     78             raise ValueError(
---> 79                 'No file named "{}" was found in directory "{}".'.format(script, directory)
     80             )
     81 

ValueError: No file named "train.py" was found in directory "./ds_pipeline/src".

As a reference, here are the file tree, I am using blue arrow pointed notebook to execute the local training job image

for convenient your replicate if you want to do so

.
├── build_and_exec_params.json
├── build_and_exec.py
├── build_requirements.txt
├── data_helpers
│   ├── data_acquire.py
│   ├── data_cleanup.py
│   ├── data_prep_guidebook.ipynb
│   ├── __pycache__
│   │   ├── data_acquire.cpython-36.pyc
│   │   └── data_cleanup.cpython-36.pyc
│   └── snowflake.zip
├── downstream_preview
│   ├── sagemaker_endpoint_template.yml
│   └── sagemaker_project_shareside_template.yml
├── ds_pipeline
│   ├── data_evaluation.py
│   ├── data_ingestion.py
│   ├── feature_engineering.py
│   ├── __init__.py
│   ├── model_evaluation.py
│   ├── pipeline.py
│   ├── readme.md
│   ├── requirements.txt
│   └── src
│       ├── encoders.py
│       ├── inference.py
│       ├── nn_model.py
│       ├── predictor.py
│       ├── requirements.txt
│       └── train.py
├── modelbuild_buildspec.yml
├── Project_Report.md
├── README.md
└── sagemaker_modelbuild_project_assistant
    ├── endpoint_deployment_test.ipynb
    ├── kill_resources.ipynb
    ├── modelpackage_injector.ipynb
    ├── orchestrial_procedure.ipynb
    ├── resources_killer
    │   ├── local_mode_resource_killer.sh
    │   ├── model_package_killer.py
    │   └── pipeline_killer.py
    ├── standalone_pipeline_run.ipynb
    └── stepscript_templates
        ├── inference_seed_script.py
        ├── processing_seed_script.py
        ├── tensorflow_seed_pipeline.py
        └── training_seed_script.py
orriduck commented 3 years ago

Using absolute path will crack this issue.