e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"
https://aclanthology.org/2021.tacl-1.58/
MIT License
113 stars 24 forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #22

Closed mtkinit closed 1 year ago

mtkinit commented 1 year ago

Hi all,

I tried to run the code by using three different setups, but I always get the same error:

Traceback (most recent call last):
File "train_task.py", line 34, in <module>
  from volta.task_utils import LoadDataset, LoadLoss, ForwardModelsTrain, ForwardModelsVal
File "/data/volta/task_utils.py", line 19, in <module>
  from volta.datasets import DatasetMapTrain, DatasetMapEval
File "/data/volta/datasets/__init__.py", line 23, in <module>
  from .SVO_Probes_dataset import SVO_ProbesClassificationDataset
File "/data/volta/datasets/SVO_Probes_dataset.py", line 20, in <module>
  p = Pipeline(lang='english', gpu = True, cache_dir = './cache')
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/trankit/pipeline.py", line 85, in __init__
  self._embedding_layers.half()
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 757, in half
  return self._apply(lambda t: t.half() if t.is_floating_point() else t)
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
  module._apply(fn)
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
  module._apply(fn)
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
  module._apply(fn)
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
  param_applied = fn(param)
File "/root/anaconda3/envs/volta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 757, in <lambda>
  return self._apply(lambda t: t.half() if t.is_floating_point() else t)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I followed the Repository setup steps in the README file (plus, after the setup I also needed to install nltk through anaconda and trankit through pip). These setups were:

These are the exact commands I executed on both the VM and inside the Docker containers:

conda create -n volta python=3.6
conda activate volta
pip install -r requirements.txt
conda install pytorch=1.4.0 torchvision cudatoolkit=10.1 -c pytorch  #Remove torchvision version as 0.5 is not available

apt install git
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./

cd ..
cd tools/refer; make

cd ../..
python setup.py develop

conda install nltk
pip install trankit

Then, I ran the following command and after that I received the error above:

python train_task.py \
        --config_file config/vilbert_base.json --from_pretrained ctrl_vilbert.bin \
        --tasks_config_file config_tasks/vilbert_tasks.yml --task 20 \
        --adam_epsilon 1e-6 --adam_betas 0.9 0.999 --weight_decay 0.01 --warmup_proportion 0.1 --clip_grad_norm 0.0 \
        --logdir logs/SVO-Probes/ --task_specific_tokens

Any suggestions? Thanks!

e-bug commented 1 year ago

Hi!

From the message above:

RuntimeError: CUDA error: no kernel image is available for execution on the device

it seems just a problem related to your GPU or PyTorch installation. Any luck by now?