aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
194 stars 86 forks source link

Error importing torchaudio #226

Open bbalaji-ucsd opened 3 years ago

bbalaji-ucsd commented 3 years ago

I'm trying to install torchaudio inside the PyTorch container and run into this error. Looking at online forums indicate that multiple torch versions or CUDA issues lead to this error. I tried installing a version that is compatible with the existing torch version (1.6.0) in the container, but it failed with the same error.

Traceback (most recent call last):
  File "train_model.py", line 32, in <module>
    import torchaudio
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/__init__.py", line 1, in <module>
    from . import extension
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/__init__.py", line 5, in <module>
    _init_extension()
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/extension.py", line 12, in _init_extension
    _init_script_module(ext)
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/extension.py", line 19, in _init_script_module
    torch.classes.load_library(path)
  File "/opt/conda/lib/python3.6/site-packages/torch/_classes.py", line 46, in load_library
    torch.ops.load_library(path)
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/opt/conda/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/conda/lib/python3.6/site-packages/torchaudio/_torchaudio.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs
Roshrini commented 3 years ago

@bbalaji-ucsd which PyTorch container(PT version) did you use for this? what is the torchvision version?

bbalaji-ucsd commented 3 years ago

@Roshrini I tried both 1.7 and 1.8 containers and they didn't work. I'm able to make my custom docker work with 1.6 though. I'm using that container for my current experiments.

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04-v1.5

RUN pip install -U pip
RUN pip install -U torch
RUN pip install -U torchvision
# torchaudio doesn't work with the 1.7.1 or 1.8.1 containers
RUN pip install torchaudio