Open austinmw opened 2 years ago
im getting the same error. @austinmw did you ever resolve this issue?
I've had the same issue until I found this https://github.com/aws-samples/amazon-sagemaker-pytorch-detectron2/issues/8 It seems like you need to extend the DLC with the official torch and torchvision packages
@salmenhsairi That is a known workaround, but really you shouldn't need to uninstall and reinstall torch.
@austinmw Unless there's another method to upgrade the dlc existing torch version which is optimized, as detectron2 requires the complete one.
@salmenhsairi There is not currently. My point is that it would be ideal for the SageMaker version of Torch to not be modified in a way that breaks compatibility with other libraries.
I'm using the huggingface container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10-transformers4.17-gpu-py38-cu113-ubuntu20.04
I extended the Huggingface container with the following commands:
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
RUN python -m pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
The interesting thing is that for the Huggingface container, it says it already is up-to-date with Torch and Torchvision packages.
Got a similar error as Austin.
RuntimeError:
undefined value has_torch_function_variadic:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
>>> loss.backward()
"""
if has_torch_function_variadic(input, target, weight, pos_weight):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
Guess ill try the Pytorch container instead...
@d-v-dlee I don't think it would show being out of date; the version still matches. The problem is that it's been modified. You could uninstall and reinstall torch, though in your case there's already huggingface prebuilt containers that you can use.
@d-v-dlee i am also using a huggingface container and this image did worked fine for me on an aws ml.g4dn.xlarge instance. try downloading torch from this link instead https://download.pytorch.org/whl/torch_stable.html
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04
RUN pip uninstall torch -y
RUN pip uninstall torchvision -y
############# Detectron2 pre-built binaries Pytorch default install ############
RUN pip install --no-cache-dir --upgrade torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
############# Detectron2 section ##############
RUN pip install \
--no-cache-dir pycocotools~=2.0.0 \
--no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/detectron2-0.6%2Bcu111-cp38-cp38-linux_x86_64.whl
ENV FORCE_CUDA="1"
# Build D2 only for Volta architecture - V100 chips (ml.p3 AWS instances)
# ENV TORCH_CUDA_ARCH_LIST="Volta"
# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"
Instead of installing and uninstalling torch and torchvision, turning debugger_hook_config
to False helped resolve the smdebug.
This is with the latest Huggingface container (pytorch 1.10 and cuda 11.3)
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
image_uri = base_image_uri,
instance_count=1,
role=role,
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
debugger_hook_config=False,
volume_size=50,
hyperparameters = hyperparameters)
Checklist
Concise Description:
Detectron2 errors when being installed on top of pytorch-training container. It appears to be related to
smdebug
.How to reproduce:
DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
Current behavior:
Traceback:
Expected behavior:
No error on import