exception on loading dataset

friskit-china commented 5 years ago

Hi

I encountered an exception on the data loading process: https://github.com/LuoweiZhou/densecap/blob/master/data/anet_dataset.py#L146

I got this:

Traceback (most recent call last): File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/pool.py", line 108, in worker task = get() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/queues.py", line 337, in get return _ForkingPickler.loads(res) File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda torch.cuda._lazy_init() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/site-packages/torch/cuda/init.py", line 159, in _lazy_init "Cannot re-initialize CUDA in forked subprocess. " + msg) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Is ther any idea to fix it?

friskit-china commented 5 years ago

I tried to replace "import subprocessing" with "from torch import subprocessing", but I still got the same exception. :(

LuoweiZhou commented 5 years ago

Hi @friskit-china, can you first try setting --num_workers 0? From my experience, the multiprocessing package does not always go well with multi-worker. This might slow things down, so you might want to set pin_memory=True in your dataloader. You can also try --num_workers 1 to see if it works.

Besides, judging from the error msg, I'd suggest you look at this: https://github.com/pytorch/pytorch/issues/1494#issuecomment-305993854

friskit-china commented 5 years ago

Hi Luowei, I still got the same exception when I used '--num_workers 0'. I am trying to replace the data processing code in "https://github.com/LuoweiZhou/densecap/blob/master/data/anet_dataset.py#L139-L153" with non-multiprocessing version. I will report the result later. Thanks :)

friskit-china commented 5 years ago

Hi @LuoweiZhou I followed the instruction in https://stackoverflow.com/questions/48822463/how-to-use-pytorch-multiprocessing and replace the import multiprocessing with

from torch import multiprocessing
multiprocessing.set_start_method('spawn')

And now I have another exception.

pos anc: 79, neg anc: 5412 video: nfYzqyureLo video: skrWT6xHVoI Traceback (most recent call last): File "scripts/train.py", line 549, in main(args) File "scripts/train.py", line 241, in main train_loader, valid_loader, text_proc, train_sampler = get_dataset(args) File "scripts/train.py", line 154, in get_dataset sample_listpath=args.train_samplelist_path, File "/s1_md0/v-botsh/Research/Repo/densecap/data/anet_dataset.py", line 155, in init results[i] = r.get() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value multiprocessing.pool.MaybeEncodingError: Error sending result: '('/s1_md0/v-botsh/Research/Repo/densecap/data/yc2/training/GLd3aX16zBg', 483, defaultdict(<class 'list'>, {2: [(2172, array([0.75006209]), 0.2875992944210204, 0.16418342852412277, tensor([ 2, 24, 4, 44, 76, 11, 36, 12, 293, 211, 5, 24, 9, 8, 58, 10, 3, 1, 1, 1], device='cuda:0')), (2173, array([0.75006209]), 0.2875992944210204, 0.05307231741301166, (Truncated)

I think it is because of the multiprocessing do not support encoding processed Tensor data and traffic through processes. Do you have any ideas? Thanks~

LuoweiZhou commented 5 years ago

@ybzhou, any insights?

LuoweiZhou commented 5 years ago

Also, @friskit-china could you post here your environment configs? Like CUDA/PyTorch version returned by the torch package, ubuntu version

ybzhou commented 5 years ago

may I know the reason of using multi-processing? multiprocessing with pytorch usually will lead to undesirable behavior.

LuoweiZhou commented 5 years ago

@ybzhou we generate the positive/negative segments for all the videos in the dataset (inside the __init__ function) and this means a lot of computations. Hence, we decided to use multiprocessing for parallel computing. Besides, this only needs to be done once since we have options --save_train_samplelist --load_train_samplelist to save/load the segments.

mbrei commented 5 years ago

Hi, I am having the same issue. Could someone resolve the problem ?

My pytorch is version 0.4.1

And here are some details about my GPU Unbenannt

Thanks in advance ! Marie

LuoweiZhou commented 5 years ago

@mbrei thanks for the info. @ybzhou and I are looking into this issue and will get back to you soon. I suspect this results from the PyTorch 0.4 upgrade.

mbrei commented 5 years ago

@mbrei thanks for the info. @ybzhou and I are looking into this issue and will get back to you soon. I suspect this results from the PyTorch 0.4 upgrade.

Thank you very much !

LuoweiZhou commented 5 years ago

@mbrei could you tell me more about your OS version, graphics card type, cuda version, cudnn version, and pytorch version?

mbrei commented 5 years ago

@mbrei could you tell me more about your OS version, graphics card type, cuda version, cudnn version, and pytorch version?

Sure !

This is my OS version: Distributor ID: Debian Description: Debian GNU/Linux 9.9 (stretch) Release: 9.9 Codename: stretch

And here again the information about my graphics card: Unbenannt

CUDA version 10.1

But pytorch is installed with Cuda version '9.2.148' (torch.version.cuda) and cuDNN 7104 (torch.backends.cudnn.version()) Could that cause a problem ?

Pytorch is at version 0.4.1 together with

torchtext 0.3.1
torchvision 0.2.2.post3

LuoweiZhou commented 5 years ago

@mbrei 10.1 is your CUDA driver API version and 9.2 is runtime API version and they seem good. Could you try running the code on other graphics cards such as 1080 and Titan X(p)? I haven't tested the code on Kepler GPUs (e.g., K80) and we need to debug to see if this issue is GPU architecture-related.

LuoweiZhou / densecap

exception on loading dataset #3