aws / deep-learning-containers

AWS Deep Learning Containers are pre-built Docker images that make it easier to run popular deep learning frameworks and tools on AWS.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html
Other
998 stars 458 forks source link

[bug] sagemaker_tensorflow fails to import #2611

Open rchurch4 opened 1 year ago

rchurch4 commented 1 year ago

Checklist

Concise Description: Importing sagemaker_tensorflow results in an undefined symbol error:

tensorflow.python.framework.errors_impl.NotFoundError
 //usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/libPipeModeOp.so: undefined symbol: _ZN10tensorflow15TensorShapeBaseINS_11TensorShapeEEC1EN4absl12lts_202103244SpanIKlEE"""

We have tried using sagemaker_tensorflow versions 2.10.0.1.16.0 and 2.11.0.1.17.0 for their respective images (below), and both result in the same error. Our guess, from looking at how the image is built, is that the problem occurs due to the way that the sagemaker_tensorflow_extensions repository is cloned and then installed as opposed to pip installed or something of the like. The libPipeModeOp.so does not seem to exist in the git repository, so it's possible that this file should be generated on install and that this doesn't happen when installed in this way. This is further reinforced by the CMAKE file that references the libPipeMode file. In the setup.py file, the CMAKE extension is called to build the C++ files, but it seems that it is referencing pipemode_op. In the CMAKE list, the name is pipemodeop. This may be the root of the problem.

This results in us not being able to use the PipeModeDataSet class.

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.11.0-cpu-py39-ubuntu20.04-sagemaker 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.10.0-cpu-py39-ubuntu20.04-sagemaker

Current behavior: Importing sagemaker_tensorflow fails for these images.

Expected behavior: Importing sagemaker_tensorflow should not fail for these images

Additional context: It would be incredibly useful if I could pull the base image to run locally to test this myself and/or debug myself.

timestamp,message
1673990187126,━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.6/42.6 kB 12.3 MB/s eta 0:00:00
1673990187126,"Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (3.19.6)"
1673990187126,Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (3.6.0)
1673990187126,Requirement already satisfied: setuptools in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (65.6.3)
1673990187126,"Collecting tensorboard<2.11,>=2.10"
1673990187126,Downloading tensorboard-2.10.1-py3-none-any.whl (5.9 MB)
1673990187126,━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.9/5.9 MB 114.3 MB/s eta 0:00:00
1673990187126,Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (15.0.6.1)
1673990187126,Requirement already satisfied: flatbuffers>=2.0 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (23.1.4)
1673990187126,Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (3.3.0)
1673990187126,Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (0.2.0)
1673990187126,Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (2.2.0)
1673990187126,Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.9/site-packages (from tensorflow==2.10.0->-r requirements.txt (line 20)) (1.14.1)
1673990187126,"Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.9/site-packages (from astunparse>=1.6.0->tensorflow==2.10.0->-r requirements.txt (line 20)) (0.38.4)"
1673990187126,"Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (0.4.6)"
1673990187126,"Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (0.6.1)"
1673990187126,"Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (1.8.1)"
1673990187127,"Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2.2.2)"
1673990187127,"Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2.27.1)"
1673990187127,"Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2.16.0)"
1673990187127,"Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.9/site-packages (from tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (3.4.1)"
1673990187127,"Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (4.7.2)"
1673990187127,"Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (0.2.8)"
1673990187127,"Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (5.2.1)"
1673990187127,"Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.9/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (1.3.1)"
1673990187127,"Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.9/site-packages (from markdown>=2.6.8->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (4.13.0)"
1673990187127,"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (3.4)"
1673990187127,"Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2.0.12)"
1673990187127,"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2022.12.7)"
1673990187127,"Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.9/site-packages (from werkzeug>=1.0.1->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (2.1.1)"
1673990187127,"Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (3.11.0)"
1673990187127,"Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.9/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (0.4.8)"
1673990187127,"Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.9/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.11,>=2.10->tensorflow==2.10.0->-r requirements.txt (line 20)) (3.2.2)"
1673990187127,Building wheels for collected packages: train-model-package
1673990187127,Building wheel for train-model-package (setup.py): started
1673990188128,Building wheel for train-model-package (setup.py): finished with status 'done'
1673990188128,Created wheel for train-model-package: filename=train_model_package-0.0.1-py3-none-any.whl size=5606 sha256=9f92267ef322d1e378aaa7bcd251a2aef5813abe64386b91caab30221bbc94f5
1673990188128,Stored in directory: /tmp/pip-ephem-wheel-cache-lik6wy2t/wheels/40/03/3a/5f39818cea87b3c154b54d046a775b3da4b8ed9b642b8d50e6
1673990188128,Successfully built train-model-package
1673990190128,"Installing collected packages: pytz, keras, urllib3, train-model-package, tensorflow-estimator, sagemaker-tensorflow, python-dotenv, psycopg2-binary, packaging, numpy, jmespath, greenlet, SQLAlchemy, scipy, patsy, pandas, keras-preprocessing, botocore, boto3, tensorboard, tensorflow"
1673990190128,Attempting uninstall: pytz
1673990190129,Found existing installation: pytz 2022.7
1673990190129,Uninstalling pytz-2022.7:
1673990190129,Successfully uninstalled pytz-2022.7
1673990190129,Attempting uninstall: keras
1673990190129,Found existing installation: keras 2.11.0
1673990190129,Uninstalling keras-2.11.0:
1673990191129,Successfully uninstalled keras-2.11.0
1673990192129,Attempting uninstall: urllib3
1673990192130,Found existing installation: urllib3 1.26.13
1673990192130,Uninstalling urllib3-1.26.13:
1673990192130,Successfully uninstalled urllib3-1.26.13
1673990192130,Attempting uninstall: tensorflow-estimator
1673990192130,Found existing installation: tensorflow-estimator 2.11.0
1673990192130,Uninstalling tensorflow-estimator-2.11.0:
1673990192130,Successfully uninstalled tensorflow-estimator-2.11.0
1673990193130,Attempting uninstall: sagemaker-tensorflow
1673990193130,Found existing installation: sagemaker-tensorflow 2.11.0.1.17.0
1673990193130,Uninstalling sagemaker-tensorflow-2.11.0.1.17.0:
1673990193130,Successfully uninstalled sagemaker-tensorflow-2.11.0.1.17.0
1673990193130,Attempting uninstall: packaging
1673990193130,Found existing installation: packaging 23.0
1673990193130,Uninstalling packaging-23.0:
1673990193130,Successfully uninstalled packaging-23.0
1673990193130,Attempting uninstall: numpy
1673990193130,Found existing installation: numpy 1.23.5
1673990193130,Uninstalling numpy-1.23.5:
1673990193131,Successfully uninstalled numpy-1.23.5
1673990195131,Attempting uninstall: jmespath
1673990195132,Found existing installation: jmespath 1.0.1
1673990195132,Uninstalling jmespath-1.0.1:
1673990196132,Successfully uninstalled jmespath-1.0.1
1673990196132,Attempting uninstall: greenlet
1673990196132,Found existing installation: greenlet 2.0.1
1673990196132,Uninstalling greenlet-2.0.1:
1673990196132,Successfully uninstalled greenlet-2.0.1
1673990196132,Attempting uninstall: scipy
1673990196132,Found existing installation: scipy 1.8.0
1673990197132,Uninstalling scipy-1.8.0:
1673990199133,Successfully uninstalled scipy-1.8.0
1673990202134,Attempting uninstall: pandas
1673990202134,Found existing installation: pandas 1.5.2
1673990203134,Uninstalling pandas-1.5.2:
1673990205135,Successfully uninstalled pandas-1.5.2
1673990209136,Attempting uninstall: botocore
1673990209136,Found existing installation: botocore 1.29.46
1673990209136,Uninstalling botocore-1.29.46:
1673990211137,Successfully uninstalled botocore-1.29.46
1673990212137,Attempting uninstall: boto3
1673990212137,Found existing installation: boto3 1.26.46
1673990212137,Uninstalling boto3-1.26.46:
1673990212137,Successfully uninstalled boto3-1.26.46
1673990212137,Attempting uninstall: tensorboard
1673990212137,Found existing installation: tensorboard 2.11.0
1673990212137,Uninstalling tensorboard-2.11.0:
1673990212137,Successfully uninstalled tensorboard-2.11.0
1673990229159,ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
1673990229159,"tf-models-official 2.11.0 requires pyyaml<6.0,>=5.1, but you have pyyaml 6.0 which is incompatible."
1673990229159,"tf-models-official 2.11.0 requires tensorflow~=2.11.0, but you have tensorflow 2.10.0 which is incompatible."
1673990229159,"tensorflow-text 2.11.0 requires tensorflow<2.12,>=2.11.0; platform_machine != ""arm64"" or platform_system != ""Darwin"", but you have tensorflow 2.10.0 which is incompatible."
1673990229159,"tensorflow-cpu 2.11.0 requires keras<2.12,>=2.11.0, but you have keras 2.10.0 which is incompatible."
1673990229159,"tensorflow-cpu 2.11.0 requires tensorboard<2.12,>=2.11, but you have tensorboard 2.10.1 which is incompatible."
1673990229159,"tensorflow-cpu 2.11.0 requires tensorflow-estimator<2.12,>=2.11.0, but you have tensorflow-estimator 2.10.0 which is incompatible."
1673990229159,"sagemaker 2.127.0 requires boto3<2.0,>=1.26.28, but you have boto3 1.24.3 which is incompatible."
1673990229159,"gevent 22.10.2 requires greenlet>=2.0.0; platform_python_implementation == ""CPython"", but you have greenlet 1.1.2 which is incompatible."
1673990229159,"awscli 1.27.46 requires botocore==1.29.46, but you have botocore 1.27.3 which is incompatible."
1673990229159,"awscli 1.27.46 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible."
1673990229159,Successfully installed SQLAlchemy-1.4.37 boto3-1.24.3 botocore-1.27.3 greenlet-1.1.2 jmespath-1.0.0 keras-2.10.0 keras-preprocessing-1.1.2 numpy-1.22.4 packaging-21.3 pandas-1.4.2 patsy-0.5.2 psycopg2-binary-2.9.3 python-dotenv-0.20.0 pytz-2022.1 sagemaker-tensorflow-2.10.0.1.16.0 scipy-1.8.1 tensorboard-2.10.1 tensorflow-2.10.0 tensorflow-estimator-2.10.0 train-model-package-0.0.1 urllib3-1.26.9
1673990229159,WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
1673990230160,"2023-01-17 21:17:09,538 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code."
1673990230160,"2023-01-17 21:17:09,538 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process."
1673990230160,"2023-01-17 21:17:09,548 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)"
1673990230160,"2023-01-17 21:17:09,555 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)"
1673990230160,"2023-01-17 21:17:09,580 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)"
1673990230160,"2023-01-17 21:17:09,589 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)"
1673990230160,"2023-01-17 21:17:09,612 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)"
1673990230160,"2023-01-17 21:17:09,618 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)"
1673990230160,"2023-01-17 21:17:09,636 sagemaker-training-toolkit INFO     Invoking user script"
1673990230160,Training Env:
1673990230162,"{
    ""additional_framework_parameters"": {},
    ""channel_input_dirs"": {
        ""test"": ""/opt/ml/input/data/test"",
        ""train"": ""/opt/ml/input/data/train""
    },
    ""current_host"": ""algo-1"",
    ""current_instance_group"": ""homogeneousCluster"",
    ""current_instance_group_hosts"": [
        ""algo-1""
    ],
    ""current_instance_type"": ""ml.m5.large"",
    ""distribution_hosts"": [],
    ""distribution_instance_groups"": [],
    ""framework_module"": ""sagemaker_tensorflow_container.training:main"",
    ""hosts"": [
        ""algo-1""
    ],
    ""hyperparameters"": {
        ""aws-region"": ""us-east-1"",
        ""client-id"": 232,
        ""data-bucket"": ""5out-revenue-data-prod"",
        ""granularity"": ""hourly"",
        ""input-bucket"": ""5out-inputs-prod"",
        ""lookback-days"": 480,
        ""model-bucket"": ""5out-models-prod"",
        ""model_dir"": ""s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model"",
        ""predict-from-date"": ""2023-01-17"",
        ""restaurant-id"": 1,
        ""restaurant-name"": null,
        ""stage"": ""prod"",
        ""workers"": ""1"",
        ""x-out"": 35
    },
    ""input_config_dir"": ""/opt/ml/input/config"",
    ""input_data_config"": {
        ""test"": {
            ""TrainingInputMode"": ""Pipe"",
            ""S3DistributionType"": ""FullyReplicated"",
            ""RecordWrapperType"": ""None""
        },
        ""train"": {
            ""TrainingInputMode"": ""Pipe"",
            ""S3DistributionType"": ""FullyReplicated"",
            ""RecordWrapperType"": ""None""
        }
    },
    ""input_dir"": ""/opt/ml/input"",
    ""instance_groups"": [
        ""homogeneousCluster""
    ],
    ""instance_groups_dict"": {
        ""homogeneousCluster"": {
            ""instance_group_name"": ""homogeneousCluster"",
            ""instance_type"": ""ml.m5.large"",
            ""hosts"": [
                ""algo-1""
            ]
        }
    },
    ""is_hetero"": false,
    ""is_master"": true,
    ""is_modelparallel_enabled"": null,
    ""is_smddpmprun_installed"": false,
    ""job_name"": ""TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970"",
    ""log_level"": 20,
    ""master_hostname"": ""algo-1"",
    ""model_dir"": ""/opt/ml/model"",
    ""module_dir"": ""s3://5out-lms-code/code/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/source/sourcedir.tar.gz"",
    ""module_name"": ""main"",
    ""network_interface_name"": ""eth0"",
    ""num_cpus"": 2,
    ""num_gpus"": 0,
    ""num_neurons"": 0,
    ""output_data_dir"": ""/opt/ml/output/data"",
    ""output_dir"": ""/opt/ml/output"",
    ""output_intermediate_dir"": ""/opt/ml/output/intermediate"",
    ""resource_config"": {
        ""current_host"": ""algo-1"",
        ""current_instance_type"": ""ml.m5.large"",
        ""current_group_name"": ""homogeneousCluster"",
        ""hosts"": [
            ""algo-1""
        ],
        ""instance_groups"": [
            {
                ""instance_group_name"": ""homogeneousCluster"",
                ""instance_type"": ""ml.m5.large"",
                ""hosts"": [
                    ""algo-1""
                ]
            }
        ],
        ""network_interface_name"": ""eth0""
    },
    ""user_entry_point"": ""main.py"""
1673990230162,}
1673990230162,Environment variables:
1673990230162,"SM_HOSTS=[""algo-1""]"
1673990230162,SM_NETWORK_INTERFACE_NAME=eth0
1673990230162,"SM_HPS={""aws-region"":""us-east-1"",""client-id"":232,""data-bucket"":""5out-revenue-data-prod"",""granularity"":""hourly"",""input-bucket"":""5out-inputs-prod"",""lookback-days"":480,""model-bucket"":""5out-models-prod"",""model_dir"":""s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model"",""predict-from-date"":""2023-01-17"",""restaurant-id"":1,""restaurant-name"":null,""stage"":""prod"",""workers"":""1"",""x-out"":35}"
1673990230162,SM_USER_ENTRY_POINT=main.py
1673990230162,SM_FRAMEWORK_PARAMS={}
1673990230162,"SM_RESOURCE_CONFIG={""current_group_name"":""homogeneousCluster"",""current_host"":""algo-1"",""current_instance_type"":""ml.m5.large"",""hosts"":[""algo-1""],""instance_groups"":[{""hosts"":[""algo-1""],""instance_group_name"":""homogeneousCluster"",""instance_type"":""ml.m5.large""}],""network_interface_name"":""eth0""}"
1673990230162,"SM_INPUT_DATA_CONFIG={""test"":{""RecordWrapperType"":""None"",""S3DistributionType"":""FullyReplicated"",""TrainingInputMode"":""Pipe""},""train"":{""RecordWrapperType"":""None"",""S3DistributionType"":""FullyReplicated"",""TrainingInputMode"":""Pipe""}}"
1673990230162,SM_OUTPUT_DATA_DIR=/opt/ml/output/data
1673990230162,"SM_CHANNELS=[""test"",""train""]"
1673990230162,SM_CURRENT_HOST=algo-1
1673990230162,SM_CURRENT_INSTANCE_TYPE=ml.m5.large
1673990230162,SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
1673990230162,"SM_CURRENT_INSTANCE_GROUP_HOSTS=[""algo-1""]"
1673990230162,"SM_INSTANCE_GROUPS=[""homogeneousCluster""]"
1673990230162,"SM_INSTANCE_GROUPS_DICT={""homogeneousCluster"":{""hosts"":[""algo-1""],""instance_group_name"":""homogeneousCluster"",""instance_type"":""ml.m5.large""}}"
1673990230162,SM_DISTRIBUTION_INSTANCE_GROUPS=[]
1673990230162,SM_IS_HETERO=false
1673990230162,SM_MODULE_NAME=main
1673990230162,SM_LOG_LEVEL=20
1673990230162,SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
1673990230162,SM_INPUT_DIR=/opt/ml/input
1673990230162,SM_INPUT_CONFIG_DIR=/opt/ml/input/config
1673990230162,SM_OUTPUT_DIR=/opt/ml/output
1673990230163,SM_NUM_CPUS=2
1673990230163,SM_NUM_GPUS=0
1673990230163,SM_NUM_NEURONS=0
1673990230163,SM_MODEL_DIR=/opt/ml/model
1673990230163,SM_MODULE_DIR=s3://5out-lms-code/code/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/source/sourcedir.tar.gz
1673990230163,"SM_TRAINING_ENV={""additional_framework_parameters"":{},""channel_input_dirs"":{""test"":""/opt/ml/input/data/test"",""train"":""/opt/ml/input/data/train""},""current_host"":""algo-1"",""current_instance_group"":""homogeneousCluster"",""current_instance_group_hosts"":[""algo-1""],""current_instance_type"":""ml.m5.large"",""distribution_hosts"":[],""distribution_instance_groups"":[],""framework_module"":""sagemaker_tensorflow_container.training:main"",""hosts"":[""algo-1""],""hyperparameters"":{""aws-region"":""us-east-1"",""client-id"":232,""data-bucket"":""5out-revenue-data-prod"",""granularity"":""hourly"",""input-bucket"":""5out-inputs-prod"",""lookback-days"":480,""model-bucket"":""5out-models-prod"",""model_dir"":""s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model"",""predict-from-date"":""2023-01-17"",""restaurant-id"":1,""restaurant-name"":null,""stage"":""prod"",""workers"":""1"",""x-out"":35},""input_config_dir"":""/opt/ml/input/config"",""input_data_config"":{""test"":{""RecordWrapperType"":""None"",""S3DistributionType"":""FullyReplicated"",""TrainingInputMode"":""Pipe""},""train"":{""RecordWrapperType"":""None"",""S3DistributionType"":""FullyReplicated"",""TrainingInputMode"":""Pipe""}},""input_dir"":""/opt/ml/input"",""instance_groups"":[""homogeneousCluster""],""instance_groups_dict"":{""homogeneousCluster"":{""hosts"":[""algo-1""],""instance_group_name"":""homogeneousCluster"",""instance_type"":""ml.m5.large""}},""is_hetero"":false,""is_master"":true,""is_modelparallel_enabled"":null,""is_smddpmprun_installed"":false,""job_name"":""TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970"",""log_level"":20,""master_hostname"":""algo-1"",""model_dir"":""/opt/ml/model"",""module_dir"":""s3://5out-lms-code/code/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/source/sourcedir.tar.gz"",""module_name"":""main"",""network_interface_name"":""eth0"",""num_cpus"":2,""num_gpus"":0,""num_neurons"":0,""output_data_dir"":""/opt/ml/output/data"",""output_dir"":""/opt/ml/output"",""output_intermediate_dir"":""/opt/ml/output/intermediate"",""resource_config"":{""current_group_name"":""homogeneousCluster"",""current_host"":""algo-1"",""current_instance_type"":""ml.m5.large"",""hosts"":[""algo-1""],""instance_groups"":[{""hosts"":[""algo-1""],""instance_group_name"":""homogeneousCluster"",""instance_type"":""ml.m5.large""}],""network_interface_name"":""eth0""},""user_entry_point"":""main.py""}"
1673990230163,"SM_USER_ARGS=[""--aws-region"",""us-east-1"",""--client-id"",""232"",""--data-bucket"",""5out-revenue-data-prod"",""--granularity"",""hourly"",""--input-bucket"",""5out-inputs-prod"",""--lookback-days"",""480"",""--model-bucket"",""5out-models-prod"",""--model_dir"",""s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model"",""--predict-from-date"",""2023-01-17"",""--restaurant-id"",""1"",""--restaurant-name"","""",""--stage"",""prod"",""--workers"",""1"",""--x-out"",""35""]"
1673990230163,SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
1673990230163,SM_CHANNEL_TEST=/opt/ml/input/data/test
1673990230163,SM_CHANNEL_TRAIN=/opt/ml/input/data/train
1673990230163,SM_HP_AWS-REGION=us-east-1
1673990230163,SM_HP_CLIENT-ID=232
1673990230163,SM_HP_DATA-BUCKET=5out-revenue-data-prod
1673990230163,SM_HP_GRANULARITY=hourly
1673990230163,SM_HP_INPUT-BUCKET=5out-inputs-prod
1673990230163,SM_HP_LOOKBACK-DAYS=480
1673990230163,SM_HP_MODEL-BUCKET=5out-models-prod
1673990230163,SM_HP_MODEL_DIR=s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model
1673990230163,SM_HP_PREDICT-FROM-DATE=2023-01-17
1673990230163,SM_HP_RESTAURANT-ID=1
1673990230163,SM_HP_RESTAURANT-NAME=
1673990230163,SM_HP_STAGE=prod
1673990230163,SM_HP_WORKERS=1
1673990230163,SM_HP_X-OUT=35
1673990230164,PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python39.zip:/usr/local/lib/python3.9:/usr/local/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/site-packages:/usr/local/lib/python3.9/site-packages/smdebug-1.0.25b20230109-py3.9.egg:/usr/local/lib/python3.9/site-packages/pyinstrument-3.4.2-py3.9.egg:/usr/local/lib/python3.9/site-packages/pyinstrument_cext-0.2.4-py3.9-linux-x86_64.egg
1673990230164,Invoking script with the following command:
1673990230164,/usr/local/bin/python3.9 -m main --aws-region us-east-1 --client-id 232 --data-bucket 5out-revenue-data-prod --granularity hourly --input-bucket 5out-inputs-prod --lookback-days 480 --model-bucket 5out-models-prod --model_dir s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model --predict-from-date 2023-01-17 --restaurant-id 1 --restaurant-name  --stage prod --workers 1 --x-out 35
1673990230164,Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib_v2.cpython-39-x86_64-linux-gnu.so not found
1673990230164,"If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error."
1673990230164,"Warning! MPI libs are missing, but python applications are still available."
1673990231164,2023-01-17 21:17:10.308751: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib::/usr/local/lib
1673990231164,2023-01-17 21:17:10.337602: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
1673990232164,2023-01-17 21:17:11.658037: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib::/usr/local/lib
1673990232165,2023-01-17 21:17:11.658229: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/lib::/usr/local/lib
1673990232165,"2023-01-17 21:17:11.658244: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly."
1673990233165,"Traceback (most recent call last):
  File ""/usr/local/lib/python3.9/runpy.py"", line 197, in _run_module_as_main"
1673990233165,"return _run_code(code, main_globals, None,
  File ""/usr/local/lib/python3.9/runpy.py"", line 87, in _run_code"
1673990233165,"exec(code, run_globals)
  File ""/opt/ml/code/main.py"", line 7, in <module>"
1673990233165,"from helpers.neural import load_dataset
  File ""/opt/ml/code/helpers/neural.py"", line 2, in <module>"
1673990233165,"from sagemaker_tensorflow import PipeModeDataset
  File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/__init__.py"", line 15, in <module>"
1673990233165,"from sagemaker_tensorflow.pipemode import PipeModeDataset, PipeModeDatasetException
  File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 38, in <module>"
1673990233165,"class PipeModeDataset(dataset_ops.Dataset):
  File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 41, in PipeModeDataset"
1673990233165,"_tf_plugin = _load_plugin()
  File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 29, in _load_plugin"
1673990233165,"return tf.load_op_library(tf_plugin_path)
  File ""/usr/local/lib/python3.9/site-packages/tensorflow/python/framework/load_library.py"", line 54, in load_op_library"
1673990233165,lib_handle = py_tf.TF_LoadLibrary(library_filename)
1673990233165,tensorflow.python.framework.errors_impl.NotFoundError
1673990233165,: //usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/libPipeModeOp.so: undefined symbol: _ZN10tensorflow15TensorShapeBaseINS_11TensorShapeEEC1EN4absl12lts_202103244SpanIKlEE
1673990234166,"2023-01-17 21:17:13,374 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code."
1673990234166,"2023-01-17 21:17:13,375 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process."
1673990234166,"2023-01-17 21:17:13,376 sagemaker-training-toolkit ERROR    Reporting training FAILURE"
1673990234166,"2023-01-17 21:17:13,376 sagemaker-training-toolkit ERROR    NotFoundError:"
1673990234166,ExitCode 1
1673990234166,"ErrorMessage ""from sagemaker_tensorflow.pipemode import PipeModeDataset, PipeModeDatasetException
 File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 38, in <module>
 class PipeModeDataset(dataset_ops.Dataset)
 File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 41, in PipeModeDataset
 _tf_plugin = _load_plugin()
 File ""/usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/pipemode.py"", line 29, in _load_plugin
 return tf.load_op_library(tf_plugin_path)
 File ""/usr/local/lib/python3.9/site-packages/tensorflow/python/framework/load_library.py"", line 54, in load_op_library
 lib_handle = py_tf.TF_LoadLibrary(library_filename)
 tensorflow.python.framework.errors_impl.NotFoundError
 //usr/local/lib/python3.9/site-packages/sagemaker_tensorflow/libPipeModeOp.so: undefined symbol: _ZN10tensorflow15TensorShapeBaseINS_11TensorShapeEEC1EN4absl12lts_202103244SpanIKlEE"""
1673990234166,"Command ""/usr/local/bin/python3.9 -m main --aws-region us-east-1 --client-id 232 --data-bucket 5out-revenue-data-prod --granularity hourly --input-bucket 5out-inputs-prod --lookback-days 480 --model-bucket 5out-models-prod --model_dir s3://5out-models-prod/232/1/hourly/TrainModel-232-1-ed736467-46dd-4cc9-b3de-373c41254970/model --predict-from-date 2023-01-17 --restaurant-id 1 --restaurant-name  --stage prod --workers 1 --x-out 35"""
1673990234166,"2023-01-17 21:17:13,377 sagemaker-training-toolkit ERROR    Encountered exit_code 1"
ShiboXing commented 1 year ago

Hi @rchurch4. we will take a look. Thanks for the issue

ShiboXing commented 1 year ago

Hi @rchurch4, did you use an requirements.txt in source_dir to install third-party packages? If yes, can you share that file as well? We tried executing import sagemaker_tensorflow in the container and it didn't throw error. Some thrid-party pypi package installation could disrupt the shared library though.

We are still trying to replicate this issue. And if you wish to debug locally, you can docker pull the image and use local_gpu as instance type and LocallSession for your sagemaker job.

tejaschumbalkar commented 1 year ago

@rchurch4 Can you provide the above information if the issue still persist?