NVIDIA / vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Other
8.6k stars 1.2k forks source link

train.py: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=77 : an illegal memory access was encountered #114

Open linkAmy opened 5 years ago

linkAmy commented 5 years ago

I try to train model on FaceForensics dataset and unfortunately, here comes the error.

97| -------------- End ---------------- 98| CustomDatasetDataLoader 99| dataset [FaceDataset] was created 100| 295 101| 295 102| #training videos = 295 103| vid2vid 104| ---------- Networks initialized ------------- 105| ----------------------------------------------- 106| ---------- Networks initialized ------------- 107| ----------------------------------------------- 108| create web directory xxxx/vid2vid-checkpoints/edge2face_512_0626/web... 109| THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=77 : an illegal memory access was encountered 110| Traceback (most recent call last): 111| File "train.py", line 329, in 112| train() 113| File "train.py", line 104, in train 114| fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_last) 115| File "xxxx/venvpy3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call 116| result = self.forward(*input, kwargs) 117| File "xxxx/venvpy3/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward 118| return self.module(*inputs[0], *kwargs[0]) 119| File "xxxx/venvpy3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call 120| result = self.forward(input, kwargs) 121| File "xxxx/vid2vid/models/vid2vid_model_G.py", line 128, in forward 122| fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame) 123| File "xxxx/vid2vid/models/vid2vid_model_G.py", line 184, in generate_frame_train 124| fake_B_pyr[si] = self.concat([fake_B_pyr[si], fake_B.unsqueeze(1).cuda(dest_id)], dim=1) 125| RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCTensorCopy.cu:102 126| terminate called after throwing an instance of 'at::Error' 127| what(): CUDA error: invalid device pointer (CudaCachingDeleter at /pytorch/aten/src/THC/THCCachingAllocator.cpp:498) 128| frame #0: THStorage_free + 0x44 (0x7f209dead0d4 in xxxx/venvpy3/lib/python3.5/site-packages/torch/lib/libcaffe2.so) 129| frame #1: THTensor_free + 0x2f (0x7f209df4c7df in xxxx/venvpy3/lib/python3.5/site-packages/torch/lib/libcaffe2.so) 130| frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7f1feb255579 in xxxx/venvpy3/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so) 131| frame #3: torch::autograd::Variable::Impl::~Impl() + 0x291 (0x7f20a5fb9411 in xxxx/venvpy3/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) 132| frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7f20a5fb9589 in xxxx/venvpy3/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) 133| frame #5: + 0x777989 (0x7f20a5fd2989 in xxxx/venvpy3/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) 134| frame #6: + 0x777a34 (0x7f20a5fd2a34 in xxxx/venvpy3/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) 135| frame #7: python() [0x58170a] 136| frame #8: python() [0x57ffda] 137| frame #9: python() [0x57ffda] 138| frame #10: python() [0x4e96b7] 139| frame #11: python() [0x557927] 140| frame #12: python() [0x55793d] 141| frame #13: python() [0x55793d] 142| frame #14: python() [0x55793d] 143| 144| frame #21: __libc_start_main + 0xf0 (0x7f20b0926830 in /lib/x86_64-linux-gnu/libc.so.6)

the command I am using: python train.py --checkpoints_dir ~/vid2vid-checkpoints --name edge2face_512_0626 --dataroot ~/face/ --dataset_mode face --input_nc 15 --loadSize 512 --num_D 3 --gpu_ids 0,1,2,3 --n_gpus_gen 2 --n_frames_total 24

Has anyone met this problem before? And it seems to happen randomly, as last time I trained the model, it worked just fine. Thanks for any kind reply.