aws / deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Other
996 stars 455 forks source link

Add functional tests for EC2 deep canaries #4022

Closed arjkesh closed 3 months ago

arjkesh commented 3 months ago

GitHub Issue #, if available:

Note:

Description

Add EC2 functional tests to canary suite. SM and EC2 images will run these tests in us-west-2.

Tests run

NOTE: By default, docker builds are disabled. In order to build your container, please update dlc_developer_config.toml and specify the framework to build in "build_frameworks"

NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:

Expand - [ ] `sagemaker_remote_tests = true` - [ ] `sagemaker_efa_tests = true` - [ ] `sagemaker_rc_tests = true` **Additionally, please run the sagemaker local tests in at least one revision:** - [ ] `sagemaker_local_tests = true`

Formatting

DLC image/dockerfile

Builds to Execute

Expand Fill out the template and click the checkbox of the builds you'd like to execute *Note: Replace with with the major.minor framework version (i.e. 2.2) you would like to start.* - [ ] build_pytorch_training__sm - [ ] build_pytorch_training__ec2 - [ ] build_pytorch_inference__sm - [ ] build_pytorch_inference__ec2 - [ ] build_pytorch_inference__graviton - [ ] build_tensorflow_training__sm - [ ] build_tensorflow_training__ec2 - [ ] build_tensorflow_inference__sm - [ ] build_tensorflow_inference__ec2 - [ ] build_tensorflow_inference__graviton

Additional context

PR Checklist

Expand - [ ] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker] - [ ] If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true - [ ] If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.) - [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to - [ ] (If applicable) I've documented below the tests I've run on the DLC image - [ ] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See [https://www.apache.org/legal/resolved.html](https://www.apache.org/legal/resolved.html). - [ ] (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them. #### NEURON/GRAVITON Testing Checklist * When creating a PR: - [ ] I've modified `dlc_developer_config.toml` in my PR branch by setting `neuron_mode = true` or `graviton_mode = true` #### Benchmark Testing Checklist * When creating a PR: - [ ] I've modified `dlc_developer_config.toml` in my PR branch by setting `ec2_benchmark_tests = true` or `sagemaker_benchmark_tests = true`

Pytest Marker Checklist

Expand - [ ] (If applicable) I have added the marker `@pytest.mark.model("")` to the new tests which I have added, to specify the Deep Learning model that is used in the test (use `"N/A"` if the test doesn't use a model) - [ ] (If applicable) I have added the marker `@pytest.mark.integration("")` to the new tests which I have added, to specify the feature that will be tested - [ ] (If applicable) I have added the marker `@pytest.mark.multinode()` to the new tests which I have added, to specify the number of nodes used on a multi-node test - [ ] (If applicable) I have added the marker `@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)` to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

arjkesh commented 3 months ago

TF 2.14 inference tests are failing with the following error

>           raise UnexpectedExit(result)
E           invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
E           
E           Command: 'python /home/ubuntu/serving/tensorflow_serving/example/mnist_saved_model.py /home/ubuntu/serving/models/mnist'
E           
E           Exit code: 1
E           
E           Stdout:
E           
E           
E           
E           Stderr:
E           
E             File "/home/ubuntu/.local/lib/python3.9/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in <module>
E               from tensorflow.python.saved_model import builder
E             File "/home/ubuntu/.local/lib/python3.9/site-packages/tensorflow/python/saved_model/builder.py", line 23, in <module>
E               from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
E             File "/home/ubuntu/.local/lib/python3.9/site-packages/tensorflow/python/saved_model/builder_impl.py", line 26, in <module>
E               from tensorflow.python.framework import dtypes
E             File "/home/ubuntu/.local/lib/python3.9/site-packages/tensorflow/python/framework/dtypes.py", line 37, in <module>
E               _np_bfloat16 = pywrap_ml_dtypes.bfloat16()
E           TypeError: Unable to convert function return value to a Python type! The signature was
E               () -> handle