alexmojaki / snoop

A powerful set of Python debugging tools, based on PySnooper
MIT License
1.28k stars 35 forks source link

Somehow wrapping my pytorch tensor in pp breaks it when training on gpu #31

Closed ngxingyu closed 3 years ago

ngxingyu commented 3 years ago

I added one line similar to this: pp(my_tensor) to view the contents of my pytorch tensor when on gpu but it returns the following error:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1607369981906/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f11b91498b2 in /home/nxingyu2/miniconda3/envs/NLP/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f11b939bf20 in /home/nxingyu2/miniconda3/envs/NLP/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f11b9134b7d in /home/nxingyu2/miniconda3/envs/NLP/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5f9e52 (0x7f11fa901e52 in /home/nxingyu2/miniconda3/envs/NLP/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>

Not sure if this is something to post here, but would like to raise it in case anyone else face a similar issue

alexmojaki commented 3 years ago

Interesting. I don't know how pytorch works. Did you use pp inside a function that is called by C++ or the GPU?

ngxingyu commented 3 years ago

Yes I used it in one of the callback functions in pytorch lightning which is called by the GPU. Actually I believe this isn't a problem with the snoop or pytorch, it works when using CPU, and when I tried the same thing using the gruns/icecream library it also gives the same error, so I suppose the way pytorch optimises the training for GPU prevents it from working. So shall I close this issue?

alexmojaki commented 3 years ago

icecream uses the same underlying library written by me, so I'm equally responsible for that. I'm just wondering if it's impossible to access source code in these circumstances, or something else is broken. What else is different within these function calls? Can you read files? Can you use with snoop? Do other exceptions also lead to such crpytic errors? What if you do this?

import traceback

def foo():
    try:
        pp(...)
    except:
        traceback.print_exc()
alexmojaki commented 3 years ago

@zasdfgbnm do you know anything about this?

zasdfgbnm commented 3 years ago

I don't know, it looks more like a pytorch problem. Do you see the same error on latest version of pytorch?

ngxingyu commented 3 years ago

Oh I upgraded my pytorch-lightning from 1.1.2 to 1.1.5 and the problem was fixed. Thanks!