Closed ChengYuChuan closed 3 months ago
Hi @ChengYuChuan , I am sorry you are running into hardware problems again!
I did not encounter this issue, but by looking at your hardware specs (GTX 1080 Ti) and the date of the ALBEF model publication, I am wondering whether you have the latest NVIDIA drivers.
What driver version does it say when you run nvidia-smi
?
I am a bit confused about the issue, because your script seems to pass line 275, which is great, meaning you can now run a model inference! 🥳
Hello @LetiP ,
thank you for issue review.
here is the result of the command nvidia-smi
GPU-08:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:5E:00.0 Off | N/A |
| 29% 19C P8 8W / 250W | 4MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:86:00.0 Off | N/A |
| 48% 63C P2 201W / 250W | 7650MiB / 11264MiB | 98% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce GTX 1080 Ti Off | 00000000:AF:00.0 Off | N/A |
| 29% 21C P8 7W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 1 N/A N/A 541503 C python 7646MiB |
+-----------------------------------------------------------------------------------------+
GPU-09:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:3B:00.0 Off | N/A |
| 25% 28C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:5E:00.0 Off | N/A |
| 25% 22C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce GTX 1080 Ti Off | 00000000:86:00.0 Off | N/A |
| 25% 21C P8 12W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce GTX 1080 Ti Off | 00000000:AF:00.0 Off | N/A |
| 25% 21C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Hi, this looks good.
Then the next thing is to ensure that the installed pytorch version matches the cuda version.
https://pytorch.org/get-started/locally/
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
hello @LetiP
In the beginning, I installed the environment exactly with environment.yml
under the command conda env create -f environment.yml
.
I check my own version of these both from the environment.yml
Now, I have higher version than the environment. torchaudio 0.10.2 py36_cu111 pytorch torchvision 0.11.3 py36_cu111 pytorch
my conda list
result is down below:
(shap) cheng@login:~/MM-SHAP$ conda list
# packages in environment at /home/students/cheng/anaconda3/envs/shap:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main anaconda
_openmp_mutex 4.5 1_gnu anaconda
_py-xgboost-mutex 2.0 cpu_0 anaconda
abseil-cpp 20210324.2 h9c3ff4c_0 conda-forge
aiohttp 3.7.4.post0 py36h8f6f2f9_0 conda-forge
argon2-cffi 20.1.0 py36h27cfd23_1 anaconda
arrow-cpp 3.0.0 py36h6b21186_4 anaconda
async-timeout 3.0.1 py_1000 conda-forge
async_generator 1.10 py36h28b3542_0 anaconda
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
autopep8 1.5.7 pyhd3eb1b0_0 anaconda
aws-c-common 0.4.57 he6710b0_1 anaconda
aws-c-event-stream 0.1.6 h2531618_5 anaconda
aws-checksums 0.1.9 he6710b0_0 anaconda
aws-sdk-cpp 1.8.185 hce553d0_0 anaconda
backports 1.0 py_2 anaconda
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
blas 1.0 mkl anaconda
bleach 4.0.0 pyhd3eb1b0_0 anaconda
boost-cpp 1.69.0 h11c811c_1000 conda-forge
brotli 1.0.9 h7f98852_5 conda-forge
brotli-bin 1.0.9 h7f98852_5 conda-forge
brotlipy 0.7.0 py36h27cfd23_1003 anaconda
bzip2 1.0.8 h7b6447c_0 anaconda
c-ares 1.17.1 h27cfd23_0 anaconda
ca-certificates 2020.10.14 0 anaconda
certifi 2020.6.20 py36_0 anaconda
cffi 1.14.6 py36h400218f_0 anaconda
chardet 4.0.0 py36h5fab9bb_1 conda-forge
charset-normalizer 2.0.4 pyhd3eb1b0_0 anaconda
click 7.1.2 pyh9f0ad1d_0 conda-forge
cloudpickle 2.0.0 pyhd3eb1b0_0 anaconda
configparser 5.2.0 pyhd8ed1ab_0 conda-forge
cryptography 3.4.7 py36hd23ed53_0 anaconda
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.4.99 0 nvidia
cuda-runtime 12.1.0 0 nvidia
cudatoolkit 11.1.74 h6bb024c_0 nvidia
cycler 0.10.0 py36_0 anaconda
cytoolz 0.11.0 py36h7b6447c_0 anaconda
dask-core 2021.3.0 pyhd3eb1b0_0 anaconda
dataclasses 0.8 pyh4f3eec9_6 anaconda
datasets 1.12.1 pyhd8ed1ab_1 conda-forge
dbus 1.13.18 hb2f20db_0 anaconda
decorator 5.1.0 pyhd8ed1ab_0 conda-forge
defusedxml 0.7.1 pyhd3eb1b0_0 anaconda
dill 0.3.4 pyhd8ed1ab_0 conda-forge
docker-pycreds 0.4.0 py_0 anaconda
double-conversion 3.1.5 h9c3ff4c_2 conda-forge
entrypoints 0.3 pyhd8ed1ab_1003 conda-forge
expat 2.4.1 h2531618_2 anaconda
ffmpeg 4.2.2 h20bf706_0 anaconda
filelock 3.0.12 pyhd3eb1b0_1 anaconda
fontconfig 2.13.1 h6c09931_0 anaconda
freetype 2.10.4 h5ab3b9f_0 anaconda
fsspec 2021.10.0 pyhd8ed1ab_0 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
gitdb 4.0.9 pyhd8ed1ab_0 conda-forge
gitpython 3.1.11 py_0 conda-forge
glib 2.69.1 h5202010_0 anaconda
glog 0.5.0 h48cff8f_0 conda-forge
gmp 6.2.1 h2531618_2 anaconda
gnutls 3.6.15 he1e5248_0 anaconda
grpc-cpp 1.39.0 hae934f6_5 anaconda
gst-plugins-base 1.14.0 h8213a91_2 anaconda
gstreamer 1.14.0 h28cd5cc_2 anaconda
hdf5 1.10.2 hba1933b_1 anaconda
huggingface_hub 0.0.17 py_0 huggingface
icu 58.2 he6710b0_3 anaconda
idna 3.2 pyhd3eb1b0_0 anaconda
idna_ssl 1.1.0 py36h9f0ad1d_1001 conda-forge
imagehash 4.2.1 pyhd3eb1b0_0 anaconda
imageio 2.9.0 pyhd3eb1b0_0 anaconda
importlib-metadata 4.8.1 py36h06a4308_0 anaconda
importlib_metadata 4.8.1 hd3eb1b0_0 anaconda
intel-openmp 2021.3.0 h06a4308_3350 anaconda
ipykernel 5.5.5 py36hcb3619a_0 conda-forge
ipython 5.8.0 py36_1 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
ipywidgets 7.6.5 pyhd3eb1b0_1 anaconda
jinja2 3.0.1 pyhd3eb1b0_0 anaconda
joblib 1.0.1 pyhd3eb1b0_0 anaconda
jpeg 9b h024ee3a_2
jsonschema 3.2.0 pyhd3eb1b0_2 anaconda
jupyter_client 7.0.6 pyhd8ed1ab_0 conda-forge
jupyter_core 4.8.1 py36h5fab9bb_0 conda-forge
jupyterlab_pygments 0.1.2 py_0 anaconda
jupyterlab_widgets 1.0.0 pyhd3eb1b0_1 anaconda
kiwisolver 1.3.1 py36h2531618_0 anaconda
krb5 1.19.2 hcc1bbae_0 conda-forge
lame 3.100 h7b6447c_0 anaconda
lcms2 2.12 h3be6417_0 anaconda
ld_impl_linux-64 2.35.1 h7274673_9 anaconda
libboost 1.73.0 h3ff78a5_11
libbrotlicommon 1.0.9 h7f98852_5 conda-forge
libbrotlidec 1.0.9 h7f98852_5 conda-forge
libbrotlienc 1.0.9 h7f98852_5 conda-forge
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.9.0.20 0 nvidia
libcurand 10.3.5.119 0 nvidia
libcurl 7.78.0 h0b77cf5_0 anaconda
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 hcdb4288_3 conda-forge
libffi 3.3 he6710b0_2 anaconda
libgcc-ng 9.3.0 h5101ec6_17 anaconda
libgfortran-ng 7.5.0 ha8ba4b0_17 anaconda
libgfortran4 7.5.0 ha8ba4b0_17 anaconda
libgomp 9.3.0 h5101ec6_17 anaconda
libidn2 2.3.2 h7f8727e_0 anaconda
libllvm10 10.0.1 hbcb73fb_5 anaconda
libnghttp2 1.43.0 h812cca2_0 conda-forge
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libopus 1.3.1 h7b6447c_0 anaconda
libpng 1.6.37 hbc83047_0 anaconda
libprotobuf 3.17.2 h4ff587b_1 anaconda
libsodium 1.0.18 h36c2ea0_1 conda-forge
libssh2 1.9.0 h1ba5d50_1 anaconda
libstdcxx-ng 9.3.0 hd4cf53a_17 anaconda
libtasn1 4.16.0 h27cfd23_0 anaconda
libthrift 0.14.2 he6d91bd_1 conda-forge
libtiff 4.2.0 h85742a9_0
libunistring 0.9.10 h27cfd23_0 anaconda
libuuid 1.0.3 h1bed415_2 anaconda
libuv 1.40.0 h7b6447c_0 anaconda
libvpx 1.7.0 h439df22_0 anaconda
libwebp-base 1.2.0 h27cfd23_0 anaconda
libxcb 1.14 h7b6447c_0 anaconda
libxgboost 1.3.3 h2531618_0 anaconda
libxml2 2.9.12 h03d6c58_0 anaconda
llvmlite 0.36.0 py36h612dafd_4 anaconda
lz4-c 1.9.3 h295c915_1 anaconda
markupsafe 2.0.1 py36h27cfd23_0 anaconda
matplotlib 3.3.4 py36h06a4308_0 anaconda
matplotlib-base 3.3.4 py36h62a2d02_0 anaconda
mistune 0.8.4 py36h7b6447c_0 anaconda
mkl 2020.2 256 anaconda
mkl-service 2.3.0 py36he8ac12f_0
mkl_fft 1.3.0 py36h54f3939_0
mkl_random 1.1.1 py36h0573a6f_0 anaconda
multidict 5.1.0 py36h27cfd23_2 anaconda
multiprocess 0.70.12.2 py36h8f6f2f9_0 conda-forge
nbclient 0.5.3 pyhd3eb1b0_0 anaconda
nbconvert 6.0.7 py36_0 anaconda
nbformat 5.1.3 pyhd3eb1b0_0 anaconda
ncurses 6.2 he6710b0_1 anaconda
nest-asyncio 1.5.1 pyhd8ed1ab_0 conda-forge
nettle 3.7.3 hbbd107a_1 anaconda
networkx 2.5 py_0 anaconda
ninja 1.10.2 hff7bd54_1 anaconda
notebook 6.3.0 py36h06a4308_0 anaconda
numba 0.53.1 py36ha9443f7_0 anaconda
numpy 1.19.2 py36h54aff64_0
numpy-base 1.19.2 py36hfa32c7d_0
olefile 0.46 py36_0 anaconda
opencv 3.4.1 py36h6fd60c2_1 anaconda
opencv-python 4.5.3.56 pypi_0 pypi
openh264 2.1.0 hd408876_0 anaconda
openjpeg 2.4.0 h3ad879b_0 anaconda
openssl 1.1.1n h7f8727e_0 anaconda
orc 1.6.9 ha97a36c_3 anaconda
packaging 21.0 pyhd3eb1b0_0 anaconda
pandas 1.1.5 py36ha9443f7_0 anaconda
pandoc 2.12 h06a4308_0 anaconda
pandocfilters 1.4.3 py36h06a4308_1 anaconda
pathtools 0.1.2 py_1 anaconda
pcre 8.45 h295c915_0 anaconda
pexpect 4.8.0 pyh9f0ad1d_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 8.3.1 py36h2c7a002_0 anaconda
pip 21.2.2 py36h06a4308_0 anaconda
prometheus_client 0.11.0 pyhd3eb1b0_0 anaconda
promise 2.3 py36h5fab9bb_4 conda-forge
prompt_toolkit 1.0.15 py_1 conda-forge
protobuf 3.17.2 py36h295c915_0 anaconda
psutil 5.8.0 py36h27cfd23_1 anaconda
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
py-xgboost 1.3.3 py36h06a4308_0 anaconda
pyarrow 3.0.0 py36he0739d4_3 anaconda
pycodestyle 2.7.0 pyhd3eb1b0_0 anaconda
pycparser 2.20 py_2 anaconda
pygments 2.10.0 pyhd8ed1ab_0 conda-forge
pyopenssl 20.0.1 pyhd3eb1b0_1 anaconda
pyparsing 2.4.7 pyhd3eb1b0_0 anaconda
pyqt 5.9.2 py36h05f1152_2 anaconda
pyrsistent 0.17.3 py36h7b6447c_0 anaconda
pysocks 1.7.1 py36h06a4308_0 anaconda
python 3.6.13 h12debd9_1 anaconda
python-dateutil 2.8.2 pyhd3eb1b0_0 anaconda
python-wget 3.2 py_0 conda-forge
python-xxhash 2.0.2 py36h8f6f2f9_0 conda-forge
python_abi 3.6 1_cp36m huggingface
pytorch 1.10.2 py3.6_cuda11.1_cudnn8.0.5_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2021.1 pyhd3eb1b0_0 anaconda
pywavelets 1.1.1 py36h7b6447c_2 anaconda
pyyaml 5.4.1 py36h27cfd23_1 anaconda
pyzmq 19.0.2 py36h9947dbf_2 conda-forge
qt 5.9.7 h5867ecd_1
re2 2021.08.01 h9c3ff4c_0 conda-forge
readline 8.1 h27cfd23_0 anaconda
regex 2021.8.3 py36h7f8727e_0 anaconda
requests 2.26.0 pyhd3eb1b0_0 anaconda
ruamel_yaml 0.15.87 py36h7b6447c_1 anaconda
sacremoses master py_0 huggingface
scikit-image 0.17.2 py36hdf5156a_0 anaconda
scikit-learn 0.24.2 py36ha9443f7_0 anaconda
scipy 1.5.2 py36h0b6359f_0
send2trash 1.8.0 pyhd3eb1b0_1 anaconda
sentry-sdk 1.5.4 pyhd8ed1ab_0 conda-forge
setuptools 58.0.4 py36h06a4308_0 anaconda
shortuuid 1.0.1 py_0 conda-forge
simplegeneric 0.8.1 py_1 conda-forge
sip 4.19.8 py36hf484d3e_0 anaconda
six 1.16.0 pyhd3eb1b0_0 anaconda
slicer 0.0.7 pyhd8ed1ab_0 conda-forge
smmap 3.0.5 pyh44b312d_0 conda-forge
snappy 1.1.8 he1b5a44_3 conda-forge
sqlite 3.36.0 hc218d9a_0 anaconda
subprocess32 3.5.4 py_1 anaconda
tbb 2020.3 hfd86e86_0 anaconda
termcolor 1.1.0 py_2 conda-forge
terminado 0.9.4 py36h06a4308_0 anaconda
testpath 0.5.0 pyhd3eb1b0_0 anaconda
threadpoolctl 2.2.0 pyh0d69192_0 anaconda
tifffile 2020.10.1 py36hdd07704_2 anaconda
timm 0.5.4 pypi_0 pypi
tk 8.6.11 h1ccaba5_0 anaconda
tokenizers 0.10.3 py36_0 huggingface
toml 0.10.2 pyhd3eb1b0_0 anaconda
toolz 0.11.2 pyhd3eb1b0_0 anaconda
torchaudio 0.10.2 py36_cu111 pytorch
torchvision 0.11.3 py36_cu111 pytorch
tornado 6.1 py36h27cfd23_0 anaconda
tqdm 4.62.2 pyhd3eb1b0_1 anaconda
traitlets 4.3.3 pyhd8ed1ab_2 conda-forge
transformers 4.11.1 py_0 huggingface
typing-extensions 3.10.0.2 hd3eb1b0_0 anaconda
typing_extensions 3.10.0.2 pyh06a4308_0 anaconda
uriparser 0.9.3 he1b5a44_1 conda-forge
urllib3 1.26.6 pyhd3eb1b0_1 anaconda
utf8proc 2.6.1 h27cfd23_0 anaconda
wandb 0.12.10 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge
webencodings 0.5.1 py36_1 anaconda
wheel 0.37.0 pyhd3eb1b0_1 anaconda
widgetsnbextension 3.5.1 py36_0 anaconda
x264 1!157.20191217 h7b6447c_0 anaconda
xgboost 1.3.3 py36h06a4308_0 anaconda
xxhash 0.8.0 h7f98852_3 conda-forge
xz 5.2.5 h7b6447c_0 anaconda
yaml 0.2.5 h7b6447c_0 anaconda
yarl 1.6.3 py36h8f6f2f9_2 conda-forge
yaspin 2.1.0 pyhd8ed1ab_0 conda-forge
zeromq 4.3.4 h9c3ff4c_0 conda-forge
zipp 3.5.0 pyhd3eb1b0_0 anaconda
zlib 1.2.11 h7b6447c_3 anaconda
zstd 1.4.9 haebb681_0 anaconda
It looks like your installation is with cuda 11 and not 12 (it says py36_cu111). This might be the issue.
When I was conducting the project, I was using cuda 11 because cuda 12 did not exist back then. Now your cards run with cuda 12, but your pytorch installation uses cuda 11.
Try to move away from my cuda and pytorch version I used back then and install pytorch with cuda 12 and see if it helps.
https://pytorch.org/get-started/locally/
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
hmmm, I tried on python 3.8 torch 2.2 torchvision 0.17 environment. but it still shows the same problem...
I would like to try mm-shap_lxmert_dataset.py
now and check if it happens again.
The OOM:
gpu08
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
0%| | 0/534 [00:00<?, ?it/s]
0%| | 0/534 [00:00<?, ?it/s]
Traceback (most recent call last):
File "mm-shap_albef_dataset.py", line 304, in <module>
shap_values = explainer(X)
File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 60, in __call__
return super().__call__(
File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 74, in __call__
return super().__call__(
File "/home/students/cheng/MM-SHAP/shap/explainers/_explainer.py", line 258, in __call__
row_result = self.explain_row(
File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 134, in explain_row
outputs = fm(masks, zero_index=0, batch_size=batch_size)
File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 65, in __call__
return self._full_masking_call(full_masks, zero_index=zero_index, batch_size=batch_size)
File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 141, in _full_masking_call
outputs = self.model(*joined_masked_inputs)
File "/home/students/cheng/MM-SHAP/shap/models/_model.py", line 21, in __call__
return np.array(self.inner_model(*args))
File "mm-shap_albef_dataset.py", line 180, in get_model_prediction
outputs = model(masked_image.cuda(),
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "mm-shap_albef_dataset.py", line 85, in forward
output = self.text_encoder(text.input_ids,
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 1056, in forward
encoder_outputs = self.encoder(
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 594, in forward
layer_outputs = layer_module(
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 498, in forward
cross_attention_outputs = self.crossattention(
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 400, in forward
self_outputs = self.self(
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 329, in forward
attention_probs.register_hook(self.save_attn_gradients)
File "/home/students/cheng/anaconda3/envs/shap38/lib/python3.8/site-packages/torch/_tensor.py", line 562, in register_hook
raise RuntimeError(
RuntimeError: cannot register a hook on a tensor that doesn't require gradient
srun: error: gpu08: task 0: Exited with exit code 1
The conda list:
(shap38) cheng@login:~/MM-SHAP$ conda list
# packages in environment at /home/students/cheng/anaconda3/envs/shap38:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
abseil-cpp 20211102.0 h27087fc_1 conda-forge
aiohttp 3.8.1 py38h0a891b7_1 conda-forge
aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge
aom 3.6.0 h6a678d5_0
arrow-cpp 14.0.2 h374c478_1
async-timeout 4.0.3 pyhd8ed1ab_0 conda-forge
attrs 23.2.0 pyh71513ae_0 conda-forge
aws-c-auth 0.6.19 h5eee18b_0
aws-c-cal 0.5.20 hdbd6064_0
aws-c-common 0.8.5 h5eee18b_0
aws-c-compression 0.2.16 h5eee18b_0
aws-c-event-stream 0.2.15 h6a678d5_0
aws-c-http 0.6.25 h5eee18b_0
aws-c-io 0.13.10 h5eee18b_0
aws-c-mqtt 0.7.13 h5eee18b_0
aws-c-s3 0.1.51 hdbd6064_0
aws-c-sdkutils 0.1.6 h5eee18b_0
aws-checksums 0.1.13 h5eee18b_0
aws-crt-cpp 0.18.16 h6a678d5_0
aws-sdk-cpp 1.10.55 h721c034_0
blas 1.0 mkl
blosc 1.21.3 h6a678d5_0
boost-cpp 1.78.0 he72f1d9_0 conda-forge
bottleneck 1.3.4 py38h3ec907f_0 conda-forge
brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
brotli-python 1.0.9 py38h6a678d5_7
brunsli 0.1 h2531618_0
bzip2 1.0.8 h5eee18b_5
c-ares 1.19.1 h5eee18b_0
ca-certificates 2024.3.11 h06a4308_0
certifi 2024.2.2 pyhd8ed1ab_0 conda-forge
cfitsio 3.470 h5893167_7
charls 2.2.0 h2531618_0
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.1.7 py38h06a4308_0
cloudpickle 2.2.1 py38h06a4308_0
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.4.99 0 nvidia
cuda-runtime 12.1.0 0 nvidia
cytoolz 0.12.2 py38h5eee18b_0
dask-core 2023.4.1 py38h06a4308_0
dataclasses 0.8 pyhc8e2a94_3 conda-forge
datasets 2.18.0 pyhd8ed1ab_0 conda-forge
dav1d 1.2.1 h5eee18b_0
dill 0.3.8 pyhd8ed1ab_0 conda-forge
ffmpeg 4.3 hf484d3e_0 pytorch
fftw 3.3.9 h5eee18b_2
filelock 3.13.1 py38h06a4308_0
freetype 2.12.1 h4a9f257_0
frozenlist 1.3.0 py38h0a891b7_1 conda-forge
fsspec 2023.10.0 py38h06a4308_0
gflags 2.2.2 he1b5a44_1004 conda-forge
giflib 5.2.1 h5eee18b_3
glog 0.5.0 h48cff8f_0 conda-forge
gmp 6.2.1 h295c915_3
gmpy2 2.1.2 py38heeb90bb_0
gnutls 3.6.15 he1e5248_0
grpc-cpp 1.48.2 he1ff14a_1 anaconda
huggingface_hub 0.21.4 pyhd8ed1ab_0 conda-forge
icu 70.1 h27087fc_0 conda-forge
idna 3.4 py38h06a4308_0
imagecodecs 2021.8.26 py38hfcb8610_2 anaconda
imageio 2.33.1 py38h06a4308_0
importlib-metadata 7.0.1 py38h06a4308_0
importlib_metadata 7.0.1 hd3eb1b0_0
intel-openmp 2021.4.0 h06a4308_3561
jinja2 3.1.3 py38h06a4308_0
joblib 1.2.0 py38h06a4308_0
jpeg 9e h5eee18b_1
jxrlib 1.1 h7b6447c_2
krb5 1.20.1 h143b758_1
lame 3.100 h7b6447c_0
lazy_loader 0.3 py38h06a4308_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libaec 1.0.4 he6710b0_1
libavif 0.11.1 h5eee18b_0
libblas 3.9.0 12_linux64_mkl conda-forge
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libcblas 3.9.0 12_linux64_mkl conda-forge
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.9.0.20 0 nvidia
libcurand 10.3.5.119 0 nvidia
libcurl 8.5.0 h251f7ec_0
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libdeflate 1.17 h5eee18b_1
libedit 3.1.20230828 h5eee18b_0
libev 4.33 h7f8727e_1
libevent 2.1.12 hdbd6064_1 anaconda
libffi 3.4.4 h6a678d5_0
libgcc-ng 13.2.0 h807b86a_5 conda-forge
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libgomp 13.2.0 h807b86a_5 conda-forge
libiconv 1.16 h7f8727e_2
libidn2 2.3.4 h5eee18b_0
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
liblapack 3.9.0 12_linux64_mkl conda-forge
libllvm11 11.1.0 hf817b99_3 conda-forge
libllvm14 14.0.6 hdb19cb5_3
libnghttp2 1.57.0 h2d74bed_0
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libpng 1.6.39 h5eee18b_0
libprotobuf 3.20.3 he621ea3_0 anaconda
libssh2 1.10.0 hdbd6064_2
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libthrift 0.15.0 h1795dd8_2 anaconda
libtiff 4.5.1 h6a678d5_0
libunistring 0.9.10 h27cfd23_0
libwebp-base 1.3.2 h5eee18b_0
libzlib 1.2.13 hd590300_5 conda-forge
libzopfli 1.0.3 he6710b0_0
llvm-openmp 14.0.6 h9e868ea_0
llvmlite 0.38.1 py38h38d86a4_0 conda-forge
locket 1.0.0 py38h06a4308_0
lz4-c 1.9.4 h6a678d5_0
markupsafe 2.1.3 py38h5eee18b_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h7f8727e_0
mkl_fft 1.3.1 py38hd3c417c_0
mkl_random 1.2.2 py38h51133e4_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 py38h06a4308_0
multidict 6.0.2 py38h0a891b7_1 conda-forge
multiprocess 0.70.12.2 py38h0a891b7_2 conda-forge
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
networkx 3.1 py38h06a4308_0
numba 0.55.1 py38h4bf6c61_0 conda-forge
numexpr 2.8.4 py38he184ba9_0
numpy 1.19.2 py38hf89b668_1 conda-forge
numpy-base 1.24.3 py38h31eccc5_0
openh264 2.1.1 h4ff587b_0
openjpeg 2.4.0 h3ad879b_0
openssl 3.2.1 hd590300_1 conda-forge
orc 1.7.4 hb3bc3d3_1 anaconda
packaging 23.2 py38h06a4308_0
pandas 1.4.1 py38h43a58ef_0 conda-forge
partd 1.4.1 py38h06a4308_0
pillow 10.2.0 py38h5eee18b_0
pip 23.3.1 py38h06a4308_0
platformdirs 3.10.0 py38h06a4308_0
pooch 1.7.0 py38h06a4308_0
pyarrow 14.0.2 py38h1eedbd7_0
pyarrow-hotfix 0.6 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 py38h06a4308_0
python 3.8.19 h955ad1f_0
python-dateutil 2.8.2 pyhd3eb1b0_0
python-tzdata 2023.3 pyhd3eb1b0_0
python-xxhash 1.4.4 py38h1e0a361_0 conda-forge
python_abi 3.8 2_cp38 conda-forge
pytorch 2.2.1 py3.8_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2023.3.post1 py38h06a4308_0
pywavelets 1.4.1 py38h5eee18b_0
pyyaml 6.0.1 py38h5eee18b_0
re2 2022.04.01 h27087fc_0 conda-forge
readline 8.2 h5eee18b_0
regex 2022.4.24 py38h0a891b7_0 conda-forge
requests 2.31.0 py38h06a4308_1
s2n 1.3.27 hdbd6064_0
sacremoses 0.0.53 pyhd8ed1ab_0 conda-forge
safetensors 0.4.2 py38h0cc4f7c_0 conda-forge
scikit-image 0.19.2 py38h43a58ef_0 conda-forge
scikit-learn 1.0.2 py38h1561384_0 conda-forge
scipy 1.9.1 py38h14f4228_0
setuptools 68.2.2 py38h06a4308_0
six 1.16.0 pyhd3eb1b0_1
slicer 0.0.7 pyhd3eb1b0_0
snappy 1.1.10 h6a678d5_1
sqlite 3.41.2 h5eee18b_0
sympy 1.12 py38h06a4308_0
tbb 2021.8.0 hdb19cb5_0
threadpoolctl 2.2.0 pyh0d69192_0
tifffile 2021.11.2 pyhd8ed1ab_0 conda-forge
timm 0.9.16 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h1ccaba5_0
tokenizers 0.10.3 py38hb63a372_1 conda-forge
toolz 0.12.0 py38h06a4308_0
torchaudio 2.2.1 py38_cu121 pytorch
torchtriton 2.2.0 py38 pytorch
torchvision 0.17.1 py38_cu121 pytorch
tqdm 4.65.0 py38hb070fc8_0
transformers 4.11.1 pyhd8ed1ab_0 conda-forge
typing-extensions 4.9.0 py38h06a4308_1
typing_extensions 4.9.0 py38h06a4308_1
urllib3 2.1.0 py38h06a4308_1
utf8proc 2.6.1 h27cfd23_0 anaconda
wheel 0.41.2 py38h06a4308_0
xz 5.4.6 h5eee18b_0
yaml 0.2.5 h7b6447c_0
yarl 1.7.2 py38h0a891b7_2 conda-forge
zfp 0.5.5 h9c3ff4c_8 conda-forge
zipp 3.17.0 py38h06a4308_0
zlib 1.2.13 hd590300_5 conda-forge
zstd 1.5.5 hc292b87_0
Hi @LetiP
After thorough investigation, I've found that the models other than Albef are functioning as expected without any issues. Specifically, I've tested and run different models, and they seem to be performing well.
Given this, I'd like to suggest that we close the ongoing issue related to Albef for now. It appears that the problem lies specifically with Albef, and since our other models are functioning correctly, it might be beneficial to focus my attention on resolving issues with other models, such as LLaVA.
Since I would like to apply mm-shap on LLaVA, I would like to open an new a issue about that.
Hello @LetiP,
It's me again :P Thank you for your patience and time.
The spec of my usage GPU: 4x Nvidia GTX 1080 Ti (Pascal, 11GB memory), in 24 cores/48 threads/256 GB memory server
Here is my setting in the beginning of the
mm-shap_albef_dataset.py
I google for some solutions for this issue, and usually it's related to:
However, these two issues sound not like the case I have here. Do you encounter any similar problem?
Here is the OOM: