facebookresearch / DrQA

Reading Wikipedia to Answer Open-Domain Questions
Other
4.48k stars 898 forks source link

cuda runtime error (77) #37

Closed igoingdown closed 6 years ago

igoingdown commented 7 years ago

I tried to run the demo on my local machine(Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-89-generic x86_64), 64G RAM, 2 TITAN X (Pascal)) using the following command:

python3 scripts/pipeline/interactive.py  

The command above succeeded. Following the instructions showed in the the interactive env:

Interactive DrQA
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

I input:

process("who is bob dylan?")

and then I encountered the following exception prompt:

09/07/2017 03:06:52 PM: [ Processing 1 queries... ]
09/07/2017 03:06:52 PM: [ Retrieving top 5 docs... ]
09/07/2017 03:07:14 PM: [ Reading 459 paragraphs... ]
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "/usr/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "scripts/pipeline/interactive.py", line 81, in process
    question, candidates, top_n, n_docs, return_context=True
  File "/home/zmx/facebook_mc/DrQA/drqa/pipeline/drqa.py", line 184, in process
    top_n, n_docs, return_context
  File "/home/zmx/facebook_mc/DrQA/drqa/pipeline/drqa.py", line 252, in process_batch
    for batch in self._get_loader(examples, num_loaders):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 192, in __next__
    batch = pin_memory_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 124, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 124, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 118, in pin_memory_batch
    return batch.pin_memory()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 78, in pin_memory
    return type(self)().set_(storage.pin_memory()).view_as(self)
  File "/usr/local/lib/python3.5/dist-packages/torch/storage.py", line 84, in pin_memory
    return type(self)(self.size(), allocator=allocator).copy_(self)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp:258

I noticed that the RAM almost ran out while GPU RAM only used less than 600M, so I tried to minus the n_docs parameter and input:

process("who is bob dylan?", candidates=None, top_n=1, n_docs=1)

But it didn't work.

However, after I used --no-cuda , it finally worked.

python3 scripts/pipeline/interactive.py --no-cuda

The interaction is as follows:

09/07/2017 03:51:01 PM: [ Running on CPU only. ]
09/07/2017 03:51:01 PM: [ Initializing pipeline... ]
09/07/2017 03:51:01 PM: [ Initializing document ranker... ]
09/07/2017 03:51:01 PM: [ Loading /home/zmx/facebook_mc/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
09/07/2017 03:52:12 PM: [ Initializing document reader... ]
09/07/2017 03:52:12 PM: [ Loading model /home/zmx/facebook_mc/DrQA/data/reader/multitask.mdl ]
09/07/2017 03:52:18 PM: [ Initializing tokenizers and document retrievers... ]

Interactive DrQA
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

>>> process("who is bob dylan?")
09/07/2017 03:52:40 PM: [ Processing 1 queries... ]
09/07/2017 03:52:40 PM: [ Retrieving top 5 docs... ]
09/07/2017 03:52:42 PM: [ Reading 459 paragraphs... ]
09/07/2017 03:52:54 PM: [ Processed 1 queries in 14.5980 (s) ]
Top Predictions:
+------+------------+---------------------------+--------------+-----------+
| Rank |   Answer   |            Doc            | Answer Score | Doc Score |
+------+------------+---------------------------+--------------+-----------+
|  1   | songwriter | Another Side of Bob Dylan |    1169.7    |   254.78  |
+------+------------+---------------------------+--------------+-----------+

Contexts:
[ Doc = Another Side of Bob Dylan ]
Another Side of Bob Dylan is the fourth studio album by American singer and songwriter Bob Dylan, released on August 8, 1964 by Columbia Records.

Is there anybody can solve my question? --no-cuda can only temporarily solve the problem. However it is too slow for interaction.

ajfisch commented 7 years ago

I'm not sure. You can try debugging with the small reader only script: scripts/reader/interactive.py and see if you have the same cuda errors there. That might help narrow down the problem.

ajfisch commented 6 years ago

Closing due to lack of response. Feel free to reopen.

JunjieHu commented 6 years ago

I have the same problem of accessing an illegal memory during training. Here is the error message.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512378422383/work/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=77 : an illegal memory access was encountered

Does anyone have the solution to this error?

PS: this problem happens when I update pytorch to 0.3. The program works with pytorch 0.2.

ajfisch commented 6 years ago

Hi @JunjieHu, I haven't tried updating DrQA to PyTorch 0.3 yet -- maybe there's something not backwards compatible or you have a corrupted build of some sort. I'll check in a bit.