RuntimeError: cannot register a hook on a tensor that doesn't require gradient #6

Closed ChengYuChuan closed 3 months ago

ChengYuChuan commented 3 months ago

Hello @LetiP,

It's me again :P Thank you for your patience and time.

The spec of my usage GPU: 4x Nvidia GTX 1080 Ti (Pascal, 11GB memory), in 24 cores/48 threads/256 GB memory server

Here is my setting in the beginning of the mm-shap_albef_dataset.py

num_samples = "all"  # "all" or number
if num_samples != "all":
    num_samples = int(num_samples)
checkp = "mscoco"  # refcoco, mscoco, vqa, flickr30k
write_res = "yes"  # "yes" or "no"
task = "image_sentence_alignment"  # image_sentence_alignment, vqa, gqa
other_tasks_than_valse = ['mscoco', 'vqa', 'gqa', 'gqa_balanced', 'nlvr2']
use_cuda = True

DATA = {
    "existence": ["/home/students/cheng/MM-SHAP/visual7w/images",

I google for some solutions for this issue, and usually it's related to:

However, these two issues sound not like the case I have here. Do you encounter any similar problem?

Here is the OOM:

Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.

  0%|          | 0/534 [00:00<?, ?it/s]
  0%|          | 0/534 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mm-shap_albef_dataset.py", line 306, in <module>
    shap_values = explainer(X)
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 62, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 76, in __call__
    outputs=outputs, silent=silent
  File "/home/students/cheng/MM-SHAP/shap/explainers/_explainer.py", line 260, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent, **kwargs
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 134, in explain_row
    outputs = fm(masks, zero_index=0, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 65, in __call__
    return self._full_masking_call(full_masks, zero_index=zero_index, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 141, in _full_masking_call
    outputs = self.model(*joined_masked_inputs)
  File "/home/students/cheng/MM-SHAP/shap/models/_model.py", line 21, in __call__
    return np.array(self.inner_model(*args))
  File "mm-shap_albef_dataset.py", line 184, in get_model_prediction
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "mm-shap_albef_dataset.py", line 92, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 1067, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 601, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 504, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 407, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 329, in forward
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/_tensor.py", line 289, in register_hook
    raise RuntimeError("cannot register a hook on a tensor that "
RuntimeError: cannot register a hook on a tensor that doesn't require gradient
srun: error: gpu08: task 0: Exited with exit code 1
LetiP commented 3 months ago

Hi @ChengYuChuan , I am sorry you are running into hardware problems again! I did not encounter this issue, but by looking at your hardware specs (GTX 1080 Ti) and the date of the ALBEF model publication, I am wondering whether you have the latest NVIDIA drivers. What driver version does it say when you run nvidia-smi?

I am a bit confused about the issue, because your script seems to pass line 275, which is great, meaning you can now run a model inference! 🥳

ChengYuChuan commented 3 months ago

Hello @LetiP ,

thank you for issue review.

here is the result of the command nvidia-smi


| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:5E:00.0 Off |                  N/A |
| 29%   19C    P8              8W /  250W |       4MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:86:00.0 Off |                  N/A |
| 48%   63C    P2            201W /  250W |    7650MiB /  11264MiB |     98%      Default |
|                                         |                        |                  N/A |
|   2  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:AF:00.0 Off |                  N/A |
| 29%   21C    P8              7W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |

| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|    1   N/A  N/A    541503      C   python                                       7646MiB |


| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:3B:00.0 Off |                  N/A |
| 25%   28C    P8             11W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:5E:00.0 Off |                  N/A |
| 25%   22C    P8             11W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
|   2  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:86:00.0 Off |                  N/A |
| 25%   21C    P8             12W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
|   3  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:AF:00.0 Off |                  N/A |
| 25%   21C    P8             11W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |

| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|  No running processes found                                                             |
LetiP commented 3 months ago

Hi, this looks good. Then the next thing is to ensure that the installed pytorch version matches the cuda version. https://pytorch.org/get-started/locally/ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

ChengYuChuan commented 3 months ago

hello @LetiP

In the beginning, I installed the environment exactly with environment.yml under the command conda env create -f environment.yml. I check my own version of these both from the environment.yml

Now, I have higher version than the environment. torchaudio 0.10.2 py36_cu111 pytorch torchvision 0.11.3 py36_cu111 pytorch

my conda list result is down below:

LetiP commented 3 months ago

It looks like your installation is with cuda 11 and not 12 (it says py36_cu111). This might be the issue. When I was conducting the project, I was using cuda 11 because cuda 12 did not exist back then. Now your cards run with cuda 12, but your pytorch installation uses cuda 11. Try to move away from my cuda and pytorch version I used back then and install pytorch with cuda 12 and see if it helps. https://pytorch.org/get-started/locally/ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

ChengYuChuan commented 3 months ago

hmmm, I tried on python 3.8 torch 2.2 torchvision 0.17 environment. but it still shows the same problem...

I would like to try mm-shap_lxmert_dataset.py now and check if it happens again.

The OOM:

RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

  0%|          | 0/534 [00:00<?, ?it/s]
  0%|          | 0/534 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mm-shap_albef_dataset.py", line 304, in <module>
    shap_values = explainer(X)
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 60, in __call__
    return super().__call__(
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 74, in __call__
    return super().__call__(
  File "/home/students/cheng/MM-SHAP/shap/explainers/_explainer.py", line 258, in __call__
    row_result = self.explain_row(
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 134, in explain_row
    outputs = fm(masks, zero_index=0, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 65, in __call__
    return self._full_masking_call(full_masks, zero_index=zero_index, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 141, in _full_masking_call
    outputs = self.model(*joined_masked_inputs)
  File "/home/students/cheng/MM-SHAP/shap/models/_model.py", line 21, in __call__
    return np.array(self.inner_model(*args))
  File "mm-shap_albef_dataset.py", line 180, in get_model_prediction
    outputs = model(masked_image.cuda(),
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "mm-shap_albef_dataset.py", line 85, in forward
    output = self.text_encoder(text.input_ids,
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 1056, in forward
    encoder_outputs = self.encoder(
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 594, in forward
    layer_outputs = layer_module(
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 498, in forward
    cross_attention_outputs = self.crossattention(
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 400, in forward
    self_outputs = self.self(
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 329, in forward
  File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/_tensor.py", line 562, in register_hook
    raise RuntimeError(
RuntimeError: cannot register a hook on a tensor that doesn't require gradient
srun: error: gpu08: task 0: Exited with exit code 1

The conda list:

ChengYuChuan commented 3 months ago

Hi @LetiP

After thorough investigation, I've found that the models other than Albef are functioning as expected without any issues. Specifically, I've tested and run different models, and they seem to be performing well.

Given this, I'd like to suggest that we close the ongoing issue related to Albef for now. It appears that the problem lies specifically with Albef, and since our other models are functioning correctly, it might be beneficial to focus my attention on resolving issues with other models, such as LLaVA.

Since I would like to apply mm-shap on LLaVA, I would like to open an new a issue about that.