aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
497 stars 117 forks source link

Pytorch Sagemaker Container STDERR output #37

Open tanguycdls opened 5 years ago

tanguycdls commented 5 years ago

In Pytorch images all the prints in stderr are not catched and are ignored:

Describe the problem

Minimal repro / logs

Entrypoint.py:

if __name__ == '__main__':
    import sys
    sys.stderr.write('Coucou stderr')
    sys.stdout.write('Coucou stdout')
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='entrypoint.py',
                    role=role,
                    framework_version='1.1.0',
                    train_instance_count=1,
                    train_instance_type='local',
                )
estimator.fit({'config': 's3://sagemaker-eu-*************/config/test_sagemaker_1.json'})
LOGS

Creating tmpqp7i_4w3_algo-1-8gd7b_1 ... Attaching to tmpqp7i_4w3_algo-1-8gd7b_12mdone algo-1-8gd7b_1 | 2019-10-22 09:06:21,345 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training algo-1-8gd7b_1 | 2019-10-22 09:06:21,349 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) algo-1-8gd7b_1 | 2019-10-22 09:06:21,363 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed. algo-1-8gd7b_1 | 2019-10-22 09:06:21,365 sagemaker_pytorch_container.training INFO Invoking user training script. algo-1-8gd7b_1 | 2019-10-22 09:06:21,489 sagemaker-containers INFO Module entrypoint does not provide a setup.py. algo-1-8gd7b_1 | Generating setup.py algo-1-8gd7b_1 | 2019-10-22 09:06:21,489 sagemaker-containers INFO Generating setup.cfg algo-1-8gd7b_1 | 2019-10-22 09:06:21,489 sagemaker-containers INFO Generating MANIFEST.in algo-1-8gd7b_1 | 2019-10-22 09:06:21,490 sagemaker-containers INFO Installing module with the following command: algo-1-8gd7b_1 | /usr/bin/python -m pip install . algo-1-8gd7b_1 | Processing /opt/ml/code algo-1-8gd7b_1 | Building wheels for collected packages: entrypoint algo-1-8gd7b_1 | Running setup.py bdist_wheel for entrypoint ... done algo-1-8gd7b_1 | Stored in directory: /tmp/pip-ephem-wheel-cache-44kbrxy0/wheels/35/24/16/37574d11bf9bde50616c******356bc7164af8ca3 algo-1-8gd7b_1 | Successfully built entrypoint algo-1-8gd7b_1 | Installing collected packages: entrypoint algo-1-8gd7b_1 | Successfully installed entrypoint-1.0.0 algo-1-8gd7b_1 | You are using pip version 18.1, however version 19.3.1 is available. algo-1-8gd7b_1 | You should consider upgrading via the 'pip install --upgrade pip' command. algo-1-8gd7b_1 | 2019-10-22 09:06:23,054 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) algo-1-8gd7b_1 | 2019-10-22 09:06:23,069 sagemaker-containers INFO Invoking user script algo-1-8gd7b_1 | algo-1-8gd7b_1 | Training Env: algo-1-8gd7b_1 | algo-1-8gd7b_1 | { algo-1-8gd7b_1 | "additional_framework_parameters": {}, algo-1-8gd7b_1 | "channel_input_dirs": { algo-1-8gd7b_1 | "config": "/opt/ml/input/data/config" algo-1-8gd7b_1 | }, algo-1-8gd7b_1 | "current_host": "algo-1-8gd7b", algo-1-8gd7b_1 | "framework_module": "sagemaker_pytorch_container.training:main", algo-1-8gd7b_1 | "hosts": [ algo-1-8gd7b_1 | "algo-1-8gd7b" algo-1-8gd7b_1 | ], algo-1-8gd7b_1 | "hyperparameters": {}, algo-1-8gd7b_1 | "input_config_dir": "/opt/ml/input/config", algo-1-8gd7b_1 | "input_data_config": { algo-1-8gd7b_1 | "config": { algo-1-8gd7b_1 | "TrainingInputMode": "File" algo-1-8gd7b_1 | } algo-1-8gd7b_1 | }, algo-1-8gd7b_1 | "input_dir": "/opt/ml/input", algo-1-8gd7b_1 | "is_master": true, algo-1-8gd7b_1 | "job_name": "sagemaker-pytorch-2019-10-22-09-06-18-353", algo-1-8gd7b_1 | "log_level": 20, algo-1-8gd7b_1 | "master_hostname": "algo-1-8gd7b", algo-1-8gd7b_1 | "model_dir": "/opt/ml/model", algo-1-8gd7b_1 | "module_dir": "s3://sagemaker-eu-west-1-*********/sagemaker-pytorch-2019-10-22-09-06-18-353/source/sourcedir.tar.gz", algo-1-8gd7b_1 | "module_name": "entrypoint", algo-1-8gd7b_1 | "network_interface_name": "eth0", algo-1-8gd7b_1 | "num_cpus": 2, algo-1-8gd7b_1 | "num_gpus": 0, algo-1-8gd7b_1 | "output_data_dir": "/opt/ml/output/data", algo-1-8gd7b_1 | "output_dir": "/opt/ml/output", algo-1-8gd7b_1 | "output_intermediate_dir": "/opt/ml/output/intermediate", algo-1-8gd7b_1 | "resource_config": { algo-1-8gd7b_1 | "current_host": "algo-1-8gd7b", algo-1-8gd7b_1 | "hosts": [ algo-1-8gd7b_1 | "algo-1-8gd7b" algo-1-8gd7b_1 | ] algo-1-8gd7b_1 | }, algo-1-8gd7b_1 | "user_entry_point": "entrypoint.py" algo-1-8gd7b_1 | } algo-1-8gd7b_1 | algo-1-8gd7b_1 | Environment variables: algo-1-8gd7b_1 | algo-1-8gd7b_1 | SM_HOSTS=["algo-1-8gd7b"] algo-1-8gd7b_1 | SM_NETWORK_INTERFACE_NAME=eth0 algo-1-8gd7b_1 | SM_HPS={} algo-1-8gd7b_1 | SM_USER_ENTRY_POINT=entrypoint.py algo-1-8gd7b_1 | SM_FRAMEWORK_PARAMS={} algo-1-8gd7b_1 | SM_RESOURCE_CONFIG={"current_host":"algo-1-8gd7b","hosts":["algo-1-8gd7b"]} algo-1-8gd7b_1 | SM_INPUT_DATA_CONFIG={"config":{"TrainingInputMode":"File"}} algo-1-8gd7b_1 | SM_OUTPUT_DATA_DIR=/opt/ml/output/data algo-1-8gd7b_1 | SM_CHANNELS=["config"] algo-1-8gd7b_1 | SM_CURRENT_HOST=algo-1-8gd7b algo-1-8gd7b_1 | SM_MODULE_NAME=entrypoint algo-1-8gd7b_1 | SM_LOG_LEVEL=20 algo-1-8gd7b_1 | SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main algo-1-8gd7b_1 | SM_INPUT_DIR=/opt/ml/input algo-1-8gd7b_1 | SM_INPUT_CONFIG_DIR=/opt/ml/input/config algo-1-8gd7b_1 | SM_OUTPUT_DIR=/opt/ml/output algo-1-8gd7b_1 | SM_NUM_CPUS=2 algo-1-8gd7b_1 | SM_NUM_GPUS=0 algo-1-8gd7b_1 | SM_MODEL_DIR=/opt/ml/model algo-1-8gd7b_1 | SM_MODULE_DIR=s3://sagemaker-eu-west-1-***********/sagemaker-pytorch-2019-10-22-09-06-18-353/source/sourcedir.tar.gz algo-1-8gd7b_1 | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"config":"/opt/ml/input/data/config"},"current_host":"algo-1-8gd7b","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1-8gd7b"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{"config":{"TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-pytorch-2019-10-22-09-06-18-353","log_level":20,"master_hostname":"algo-1-8gd7b","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-**********/sagemaker-pytorch-2019-10-22-09-06-18-353/source/sourcedir.tar.gz","module_name":"entrypoint","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-8gd7b","hosts":["algo-1-8gd7b"]},"user_entry_point":"entrypoint.py"} algo-1-8gd7b_1 | SM_USER_ARGS=[] algo-1-8gd7b_1 | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate algo-1-8gd7b_1 | SM_CHANNEL_CONFIG=/opt/ml/input/data/config algo-1-8gd7b_1 | PYTHONPATH=/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages algo-1-8gd7b_1 | algo-1-8gd7b_1 | Invoking script with the following command: algo-1-8gd7b_1 | algo-1-8gd7b_1 | /usr/bin/python -m entrypoint algo-1-8gd7b_1 | algo-1-8gd7b_1 | algo-1-8gd7b_1 | Coucou stdout2019-10-22 09:06:23,102 sagemaker-containers INFO Reporting training SUCCESS tmpqp7i_4w3_algo-1-8gd7b_1 exited with code 0 Aborting on container exit... ===== Job Complete =====

As you see the coucou stdout has been printed, stderr has been ignored. In distant mode same result.

laurenyu commented 5 years ago

thanks for bringing this to our attention! I think for Local Mode, the fix would be modifying the code around https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L657 - I'll see if I can get to making a PR.

tanguycdls commented 5 years ago

@laurenyu thank you for your answer ! I'm not sure it comes from the local mode: I have the same issue in Distant mode in CloudWatch:

PYTHONPATH=/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python -m entrypoint

2019-10-24 07:16:56,594 sagemaker-containers INFO Reporting training SUCCESS
Coucou stdout

Training Image: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-training:1.2.0-cpu-py3.

laurenyu commented 5 years ago

you're right - my bad. the fix needs to happen at https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_process.py#L29, where None is replaced with subprocess.STDOUT. I'm working on the fix, and will post updates here as I progress.

saimidu commented 4 years ago

@tanguycdls Thank you for waiting. Images for PyTorch 1.3.1 have been released, and the equivalent CPU Py3 image can be found at 763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-training:1.3.1-cpu-py3

The PyTorch 1.3.1 images include the update in sagemaker-containers to fix this issue.

ajaykarpur commented 4 years ago

This was fixed in https://github.com/aws/sagemaker-containers/pull/233 and reverted in https://github.com/aws/sagemaker-containers/pull/268