[bug] Detectron2 errors when installing on PyTorch DLC

austinmw commented 2 years ago

Checklist

[X] I've prepended issue tag with type of change: [bug]
[X] (If applicable) I've attached the script to reproduce the bug
[X] (If applicable) I've documented below the DLC image/dockerfile this relates to
[X] (If applicable) I've documented below the tests I've run on the DLC image
[X] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
[] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:

Detectron2 errors when being installed on top of pytorch-training container. It appears to be related to smdebug.

How to reproduce:

> nvidia-docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
> pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
> python -c "from detectron2 import model_zoo"

DLC image/dockerfile:

763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker

Current behavior:

Traceback:

root@fe0954d71a8e:/# python -c "from detectron2 import model_zoo" Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/init.py", line 8, in from .model_zoo import get, get_config_file, get_checkpoint_url, get_config File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/model_zoo.py", line 9, in from detectron2.modeling import build_model File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/init.py", line 2, in from detectron2.layers import ShapeSpec File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 2, in from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in from fvcore.nn.distributed import differentiable_all_reduce File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/init.py", line 4, in from .focal_loss import ( File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn return torch.jit.script(fn, _rcb=rcb) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( RuntimeError: undefined value has_torch_function_variadic: File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
loss.backward() """ if has_torch_function_variadic(input, target, weight, pos_weight):
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
p_t = p * targets + (1 - p) * (1 - targets)
loss = ce_loss * ((1 - p_t) ** gamma) 

Expected behavior:

No error on import

d-v-dlee commented 2 years ago

im getting the same error. @austinmw did you ever resolve this issue?

salmenhsairi commented 2 years ago

I've had the same issue until I found this https://github.com/aws-samples/amazon-sagemaker-pytorch-detectron2/issues/8 It seems like you need to extend the DLC with the official torch and torchvision packages

austinmw commented 2 years ago

@salmenhsairi That is a known workaround, but really you shouldn't need to uninstall and reinstall torch.

salmenhsairi commented 2 years ago

@austinmw Unless there's another method to upgrade the dlc existing torch version which is optimized, as detectron2 requires the complete one.

austinmw commented 2 years ago

@salmenhsairi There is not currently. My point is that it would be ideal for the SageMaker version of Torch to not be modified in a way that breaks compatibility with other libraries.

d-v-dlee commented 2 years ago

I'm using the huggingface container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10-transformers4.17-gpu-py38-cu113-ubuntu20.04

I extended the Huggingface container with the following commands:

RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

RUN python -m pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

The interesting thing is that for the Huggingface container, it says it already is up-to-date with Torch and Torchvision packages.

Got a similar error as Austin.

RuntimeError: 
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
         >>> loss.backward()
    """
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
            binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")

Guess ill try the Pytorch container instead...

austinmw commented 2 years ago

@d-v-dlee I don't think it would show being out of date; the version still matches. The problem is that it's been modified. You could uninstall and reinstall torch, though in your case there's already huggingface prebuilt containers that you can use.

salmenhsairi commented 2 years ago

@d-v-dlee i am also using a huggingface container and this image did worked fine for me on an aws ml.g4dn.xlarge instance. try downloading torch from this link instead https://download.pytorch.org/whl/torch_stable.html

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04

RUN pip uninstall torch -y
RUN pip uninstall torchvision -y

############# Detectron2 pre-built binaries Pytorch default install ############
RUN pip install --no-cache-dir --upgrade torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

############# Detectron2 section ##############
RUN pip install \
   --no-cache-dir pycocotools~=2.0.0 \
   --no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/detectron2-0.6%2Bcu111-cp38-cp38-linux_x86_64.whl

ENV FORCE_CUDA="1"
# Build D2 only for Volta architecture - V100 chips (ml.p3 AWS instances)
# ENV TORCH_CUDA_ARCH_LIST="Volta"

# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"

d-v-dlee commented 2 years ago

Instead of installing and uninstalling torch and torchvision, turning debugger_hook_config to False helped resolve the smdebug.

This is with the latest Huggingface container (pytorch 1.10 and cuda 11.3)

huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='./scripts',
                                    instance_type='ml.p3.2xlarge',
                                    image_uri = base_image_uri,
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.17',
                                    pytorch_version='1.10',
                                    py_version='py38',
                                    debugger_hook_config=False,
                                    volume_size=50,
                                    hyperparameters = hyperparameters)

aws / deep-learning-containers