aws / deep-learning-containers

AWS Deep Learning Containers are pre-built Docker images that make it easier to run popular deep learning frameworks and tools on AWS.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html
Other
1.01k stars 464 forks source link

[bug] failed to install torch-tensorrt #3903

Open geraldstanje opened 6 months ago

geraldstanje commented 6 months ago

Checklist

Error Message:

09T18:21:42.631Z INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T18:21:42.882Z Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Using cached torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/home/model-server/tmp/pip-install-_yc29umj/torch-tensorrt_47a74f002be54836bec3589380d28c89/setup.py", line 125, in raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T18:21:42.882Z error: metadata-generation-failed
2024-05-09T18:21:42.882Z × Encountered error while generating package metadata.

entire log: Logs:

2024-05-09T17:52:56.655Z    Sagemaker TS environment variables have been set and will be used for single model endpoint.
2024-05-09T17:52:56.655Z    Collecting sagemaker-inference==1.10.1 (from -r /opt/ml/model/code/requirements.txt (line 1)) Downloading sagemaker_inference-1.10.1.tar.gz (23 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done'
2024-05-09T17:52:56.808Z    Collecting setfit==1.0.1 (from -r /opt/ml/model/code/requirements.txt (line 2)) Downloading setfit-1.0.1-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:52:56.808Z    Collecting transformers==4.37.2 (from -r /opt/ml/model/code/requirements.txt (line 3)) Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.4/129.4 kB 9.3 MB/s eta 0:00:00
2024-05-09T17:52:56.808Z    Requirement already satisfied: torch==2.1.0 in /opt/conda/lib/python3.10/site-packages (from -r /opt/ml/model/code/requirements.txt (line 4)) (2.1.0+cu118)
2024-05-09T17:52:57.059Z    Collecting optimum (from -r /opt/ml/model/code/requirements.txt (line 5)) Downloading optimum-1.19.2-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:52:57.059Z    Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Downloading torch_tensorrt-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
2024-05-09T17:52:57.059Z    Requirement already satisfied: boto3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.28.60)
2024-05-09T17:52:57.059Z    Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.24.4)
2024-05-09T17:52:57.059Z    Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.16.0)
2024-05-09T17:52:57.059Z    Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (5.9.5)
2024-05-09T17:52:57.059Z    Requirement already satisfied: retrying<1.4,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.3.4)
2024-05-09T17:52:57.059Z    Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.10.1)
2024-05-09T17:52:57.059Z    Collecting datasets>=2.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:52:57.059Z    Collecting sentence-transformers>=2.2.1 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:52:57.310Z    Collecting evaluate>=0.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
2024-05-09T17:52:57.310Z    Collecting huggingface-hub>=0.13.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Downloading huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
2024-05-09T17:52:57.560Z    Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) (1.1.3)
2024-05-09T17:52:57.560Z    Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (3.13.1)
2024-05-09T17:52:57.560Z    Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (23.1)
2024-05-09T17:52:58.061Z    Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (6.0)
2024-05-09T17:52:58.061Z    Collecting regex!=2019.12.17 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading regex-2024.4.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 kB 17.1 MB/s eta 0:00:00
2024-05-09T17:52:58.311Z    Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (2.31.0)
2024-05-09T17:52:58.562Z    Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
2024-05-09T17:52:58.562Z    Collecting safetensors>=0.4.1 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Downloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
2024-05-09T17:52:58.562Z    Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (4.66.4)
2024-05-09T17:52:58.562Z    Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (4.9.0)
2024-05-09T17:52:58.563Z    Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (1.12)
2024-05-09T17:52:58.563Z    Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.2.1)
2024-05-09T17:52:58.563Z    Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.1.4)
2024-05-09T17:52:58.563Z    Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (2023.12.2)
2024-05-09T17:52:58.814Z    Collecting coloredlogs (from optimum->-r /opt/ml/model/code/requirements.txt (line 5)) Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
2024-05-09T17:52:58.814Z    INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T17:52:59.065Z    Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Downloading torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Downloading torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/model-server/tmp/pip-install-ndpb_izf/torch-tensorrt_1eaee9fc2794472ca9b57c4ba02da88f/setup.py", line 125, in <module> raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T17:52:59.065Z    error: metadata-generation-failed
2024-05-09T17:52:59.065Z    × Encountered error while generating package metadata.
2024-05-09T17:52:59.065Z    ╰─> See above for output.
2024-05-09T17:52:59.065Z    note: This is an issue with the package mentioned above, not pip.
2024-05-09T17:52:59.316Z    hint: See above for details.
2024-05-09T17:52:59.316Z    2024-05-09 17:52:59,107 - sagemaker-inference - ERROR - failed to install required packages, exiting
2024-05-09T17:52:59.316Z    Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/model_server.py", line 41, in _install_requirements subprocess.check_call(pip_install_cmd) File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd)
2024-05-09T17:52:59.316Z    subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'pip', 'install', '-r', '/opt/ml/model/code/requirements.txt']' returned non-zero exit status 1.
2024-05-09T17:52:59.316Z    During handling of the above exception, another exception occurred:
2024-05-09T17:52:59.316Z    Traceback (most recent call last): File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module> serving.main() File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main _start_torchserve() File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call return attempt.get(self._wrap_exception) File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve torchserve.start_torchserve(handler_service=HANDLER_SERVICE) File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 79, in start_torchserve model_server._install_requirements() File "/opt/conda/lib/python3.10/site-packages/sagemaker_inference/model_server.py", line 44, in _install_requirements raise ValueError("failed to install required packages")
2024-05-09T17:53:01.977Z    ValueError: failed to install required packages
2024-05-09T17:53:02.072Z    Sagemaker TS environment variables have been set and will be used for single model endpoint.
2024-05-09T17:53:02.573Z    Collecting sagemaker-inference==1.10.1 (from -r /opt/ml/model/code/requirements.txt (line 1)) Using cached sagemaker_inference-1.10.1.tar.gz (23 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done'
2024-05-09T17:53:02.573Z    Collecting setfit==1.0.1 (from -r /opt/ml/model/code/requirements.txt (line 2)) Using cached setfit-1.0.1-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:53:02.573Z    Collecting transformers==4.37.2 (from -r /opt/ml/model/code/requirements.txt (line 3)) Using cached transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
2024-05-09T17:53:02.573Z    Requirement already satisfied: torch==2.1.0 in /opt/conda/lib/python3.10/site-packages (from -r /opt/ml/model/code/requirements.txt (line 4)) (2.1.0+cu118)
2024-05-09T17:53:02.573Z    Collecting optimum (from -r /opt/ml/model/code/requirements.txt (line 5)) Using cached optimum-1.19.2-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:53:02.573Z    Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
2024-05-09T17:53:02.573Z    Requirement already satisfied: boto3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.28.60)
2024-05-09T17:53:02.573Z    Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.24.4)
2024-05-09T17:53:02.573Z    Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.16.0)
2024-05-09T17:53:02.573Z    Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (5.9.5)
2024-05-09T17:53:02.573Z    Requirement already satisfied: retrying<1.4,>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.3.4)
2024-05-09T17:53:02.824Z    Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sagemaker-inference==1.10.1->-r /opt/ml/model/code/requirements.txt (line 1)) (1.10.1)
2024-05-09T17:53:02.824Z    Collecting datasets>=2.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
2024-05-09T17:53:02.824Z    Collecting sentence-transformers>=2.2.1 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
2024-05-09T17:53:02.824Z    Collecting evaluate>=0.3.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
2024-05-09T17:53:02.824Z    Collecting huggingface-hub>=0.13.0 (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) Using cached huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
2024-05-09T17:53:03.326Z    Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from setfit==1.0.1->-r /opt/ml/model/code/requirements.txt (line 2)) (1.1.3)
2024-05-09T17:53:03.326Z    Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (3.13.1)
2024-05-09T17:53:03.326Z    Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (23.1)
2024-05-09T17:53:03.576Z    Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (6.0)
2024-05-09T17:53:03.576Z    Collecting regex!=2019.12.17 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached regex-2024.4.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
2024-05-09T17:53:03.826Z    Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (2.31.0)
2024-05-09T17:53:04.077Z    Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
2024-05-09T17:53:04.077Z    Collecting safetensors>=0.4.1 (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) Using cached safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
2024-05-09T17:53:04.077Z    Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.37.2->-r /opt/ml/model/code/requirements.txt (line 3)) (4.66.4)
2024-05-09T17:53:04.077Z    Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (4.9.0)
2024-05-09T17:53:04.077Z    Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (1.12)
2024-05-09T17:53:04.077Z    Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.2.1)
2024-05-09T17:53:04.077Z    Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (3.1.4)
2024-05-09T17:53:04.077Z    Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.1.0->-r /opt/ml/model/code/requirements.txt (line 4)) (2023.12.2)
2024-05-09T17:53:04.328Z    Collecting coloredlogs (from optimum->-r /opt/ml/model/code/requirements.txt (line 5)) Using cached coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
2024-05-09T17:53:04.328Z    INFO: pip is looking at multiple versions of torch-tensorrt to determine which version is compatible with other requirements. This could take a while.
2024-05-09T17:53:04.578Z    Collecting torch-tensorrt (from -r /opt/ml/model/code/requirements.txt (line 6)) Using cached torch_tensorrt-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) Using cached torch-tensorrt-0.0.0.post1.tar.gz (9.0 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [13 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/model-server/tmp/pip-install-ou8dudye/torch-tensorrt_f613c1ea02ee46eba6289ad76ccd02c4/setup.py", line 125, in <module> raise RuntimeError(open("ERROR.txt", "r").read()) RuntimeError: ########################################################################################### The package you are trying to install is only a placeholder project on PyPI.org repository. To install Torch-TensorRT please run the following command: $ pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases ########################################################################################### [end of output] note: This error originates from a subprocess, and is likely not a problem with pip.
2024-05-09T17:53:04.578Z    error: metadata-generation-failed
2024-05-09T17:53:04.578Z    × Encountered error while generating package metadata.
2024-05-09T17:53:04.578Z    ╰─> See above for output.
2024-05-09T17:53:04.578Z    note: This is an issue with the package mentioned above, not pip.
2024-05-09T17:53:04.578Z    hint: See above for details.
2024-05-09T17:53:04.578Z    2024-05-09 17:53:04,566 - sagemaker-inference - ERROR - failed to install required packages, exiting

code/requirements.txt:

sagemaker-inference==1.10.1
setfit==1.0.1
transformers==4.37.2
torch==2.1.0
optimum
torch-tensorrt

DLC image/dockerfile: 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

Current behavior: error during installing torch-tensorrt

Expected behavior: no error

Additional context:

Can i extend the deep learning image for sagemaker as follows, push this image to aws ecr and use that image to deploy my sagemaker inference endpoint? how does the model artifact (code/inference.py code/requirements.txt model etc.) get copied into the docker container?

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

RUN pip install torch-tensorrt -f https://github.com/NVIDIA/Torch-TensorRT/releases

i see there are 2 images - can i use both for sagemaker - or only the second one?

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310

vs.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker

also the torch-tensorrt 2.2.0 whl file is available here: https://pypi.org/project/torch-tensorrt/2.2.0/ - why it cant find it?

cc @tejaschumbalkar @joaopcm1996

also, torchServe is already at version 0.10 - how can i use that version with 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310 or 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker? cc @sirutBuasai

sirutBuasai commented 6 months ago

Hi @geraldstanje, we have recently updated torchServe version to 0.11.0. Please pull the latest images to use them.

For tensor-rt, we'll require a repro steps to do so. However, we suggest taking a look at DJL TensorRT containers if you would be interested in that.

For extending DLCs, you can do so as you outlined. Model artifacts are copied into the container at runtime from the Python SDK (which I am assuming is what you're using) through a docker run

For the image tag, the two images you outlined are the same image even though the tags are different. I want to note though that FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1-gpu-py310 is us-west-2 and FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker is us-east-1 If you want to look at all available tags, you can find them in the GitHub release tags and available_images.md.

geraldstanje commented 6 months ago

we have recently updated torchServe version to 0.11.0. Please pull the latest images to use them.

whats the name of that pytorch image? e.g. 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker

is that what you refer to? https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference/docker/2.2/py3

For tensor-rt, we'll require a repro steps to do so. However, we suggest taking a look at DJL TensorRT containers if you would be interested in that.

why switch to a different image? torch-tensorrt and tensorrt can be used with torchServe...

sirutBuasai commented 6 months ago

Any supported PyTorch (PT 1.13, 2.1, 2.2) inference image would work. They all have torchserve 0.11.0. Generally, you can pull images with the following tags:

2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
2.1-gpu-py310
2.1.0-gpu-py310

These tags would pull our latest release which will be moved to the latest image every time we release a patch.

However, you may see some tags such as

2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8
2.1-gpu-py310-cu118-ubuntu20.04-sagemaker-v1
2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8-2024-05-22-19-30-53

These tags represent the specific patch releases. so using these tags would pull in specific image that was released at certain date.

can you also confirm that cuda driver matches for pytorch 2.2 and is >= 11.8 - which is also required by pytorch/TensorRT: https://github.com/pytorch/TensorRT/releases

Yes, our gpu inference image is using cuda 11.8.

i can extect the image and install torch-tensorrt 2.2 with this new image?

We don't expect any installation error with tensor-rt but you're welcomed to outline repro steps if you encounter issues and we'll be happy to reproduce and assist.

why switch to a different image? torch-tensorrt and tensorrt can be used with torchServe...

DJL containers offer tensorrt out of the box while our regular DLCs do not. DJL containers can also be used similarly to extend your own custom containers. For more information about DJL containers

geraldstanje commented 6 months ago

@sirutBuasai do you also going to release a new pytorch-inference image with cuda 12.x?

sirutBuasai commented 6 months ago

Not for PyTorch 2.1 and 2.2 Inference.

However, we are working on PyTorch 2.3 Inference with CUDA 12.1. Feel free to track this PR for when it will be released.

geraldstanje commented 6 months ago

@sirutBuasai any timeline when PyTorch 2.3 Inference with CUDA 12.1 will be available?

do you also update the triton inference image for cuda 12.x soon?

sirutBuasai commented 6 months ago

We are aiming for 6/7 for PyTorch 2.3 Inference with CUDA 12.1.

Which triton image are you referring to?

geraldstanje commented 6 months ago

@sirutBuasai i mean NVIDIA Triton Inference Server: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only - can someone build the Triton Inference Server Release 24.05?

dont see the nvidia-triton-inference-containers image in this github repo... can you send me the link?

cc @nskool

sirutBuasai commented 6 months ago

@nskool Could you assist with triton image questions?

geraldstanje commented 5 months ago

@sirutBuasai - if you go the following link it says:

Dependencies
These are the following dependencies used to verify the testcases.
Torch-TensorRT can work with other versions, but the tests are not guaranteed to pass.

Bazel 5.2.0
Libtorch 2.4.0.dev (latest nightly) (built with CUDA 12.1)
CUDA 12.1
TensorRT 10.0.1.6

https://github.com/pytorch/TensorRT

i use torch-tensorrt 2.2.0 with dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.10 and get error:

predict_fn error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr

but when i run it on ec2 with cuda - it works fine - it seems i cannot use cuda 11 and require cuda 12.x for torch-tensorrt 2.2.0...

geraldstanje commented 5 months ago

regarding NVIDIA Triton Inference Server

cc @nskool @sirutBuasai

sirutBuasai commented 5 months ago

For tensorrt installation error, could you provide the following:

  1. DLC used or any Dockerfile artifact that you've built on top of our DLC if applicable.
  2. Steps to reproduce the error including any installation commands or scripts used.