Closed friskit-china closed 5 years ago
I tried to replace "import subprocessing" with "from torch import subprocessing", but I still got the same exception. :(
Hi @friskit-china, can you first try setting --num_workers 0
? From my experience, the multiprocessing
package does not always go well with multi-worker. This might slow things down, so you might want to set pin_memory=True
in your dataloader. You can also try --num_workers 1
to see if it works.
Besides, judging from the error msg, I'd suggest you look at this: https://github.com/pytorch/pytorch/issues/1494#issuecomment-305993854
Hi Luowei, I still got the same exception when I used '--num_workers 0'. I am trying to replace the data processing code in "https://github.com/LuoweiZhou/densecap/blob/master/data/anet_dataset.py#L139-L153" with non-multiprocessing version. I will report the result later. Thanks :)
Hi @LuoweiZhou
I followed the instruction in https://stackoverflow.com/questions/48822463/how-to-use-pytorch-multiprocessing
and replace the
import multiprocessing
with
from torch import multiprocessing
multiprocessing.set_start_method('spawn')
And now I have another exception.
pos anc: 79, neg anc: 5412 video: nfYzqyureLo video: skrWT6xHVoI Traceback (most recent call last): File "scripts/train.py", line 549, in
main(args) File "scripts/train.py", line 241, in main train_loader, valid_loader, text_proc, train_sampler = get_dataset(args) File "scripts/train.py", line 154, in get_dataset sample_listpath=args.train_samplelist_path, File "/s1_md0/v-botsh/Research/Repo/densecap/data/anet_dataset.py", line 155, in init results[i] = r.get() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value multiprocessing.pool.MaybeEncodingError: Error sending result: '('/s1_md0/v-botsh/Research/Repo/densecap/data/yc2/training/GLd3aX16zBg', 483, defaultdict(<class 'list'>, {2: [(2172, array([0.75006209]), 0.2875992944210204, 0.16418342852412277, tensor([ 2, 24, 4, 44, 76, 11, 36, 12, 293, 211, 5, 24, 9, 8, 58, 10, 3, 1, 1, 1], device='cuda:0')), (2173, array([0.75006209]), 0.2875992944210204, 0.05307231741301166, (Truncated)
I think it is because of the multiprocessing do not support encoding processed Tensor data and traffic through processes. Do you have any ideas? Thanks~
@ybzhou, any insights?
Also, @friskit-china could you post here your environment configs? Like CUDA/PyTorch version returned by the torch package, ubuntu version
may I know the reason of using multi-processing? multiprocessing with pytorch usually will lead to undesirable behavior.
@ybzhou we generate the positive/negative segments for all the videos in the dataset (inside the __init__
function) and this means a lot of computations. Hence, we decided to use multiprocessing
for parallel computing. Besides, this only needs to be done once since we have options --save_train_samplelist
--load_train_samplelist
to save/load the segments.
Hi, I am having the same issue. Could someone resolve the problem ?
My pytorch is version 0.4.1
And here are some details about my GPU
Thanks in advance ! Marie
@mbrei thanks for the info. @ybzhou and I are looking into this issue and will get back to you soon. I suspect this results from the PyTorch 0.4 upgrade.
@mbrei thanks for the info. @ybzhou and I are looking into this issue and will get back to you soon. I suspect this results from the PyTorch 0.4 upgrade.
Thank you very much !
@mbrei could you tell me more about your OS version, graphics card type, cuda version, cudnn version, and pytorch version?
@mbrei could you tell me more about your OS version, graphics card type, cuda version, cudnn version, and pytorch version?
Sure !
This is my OS version: Distributor ID: Debian Description: Debian GNU/Linux 9.9 (stretch) Release: 9.9 Codename: stretch
And here again the information about my graphics card:
CUDA version 10.1
But pytorch is installed with Cuda version '9.2.148' (torch.version.cuda) and cuDNN 7104 (torch.backends.cudnn.version()) Could that cause a problem ?
Pytorch is at version 0.4.1 together with
torchtext 0.3.1
torchvision 0.2.2.post3
@mbrei 10.1 is your CUDA driver API version and 9.2 is runtime API version and they seem good. Could you try running the code on other graphics cards such as 1080 and Titan X(p)? I haven't tested the code on Kepler GPUs (e.g., K80) and we need to debug to see if this issue is GPU architecture-related.
Hi
I encountered an exception on the data loading process: https://github.com/LuoweiZhou/densecap/blob/master/data/anet_dataset.py#L146
I got this:
Traceback (most recent call last): File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/pool.py", line 108, in worker task = get() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/multiprocessing/queues.py", line 337, in get return _ForkingPickler.loads(res) File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda torch.cuda._lazy_init() File "/s1_md0/v-botsh/anaconda/py3.6_torch0.4.0/lib/python3.6/site-packages/torch/cuda/init.py", line 159, in _lazy_init "Cannot re-initialize CUDA in forked subprocess. " + msg) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Is ther any idea to fix it?