bkainz / fetalReconstruction

GPU accelerated source code for motion compensation of multi-stack MRI data
57 stars 28 forks source link

CUDA Error in SimulateSlices__(), line 2652: an illegal memory access was encountered #13

Open kaltu opened 6 years ago

kaltu commented 6 years ago

I am trying to use your SVRreconstructionGPU to reconstruction isotropic volume from 3 t2-tse volumes. The data is from Cancer imaging archive - PROSTATEx challenge.

For some patients, for example number 000, if I place sagittal plane in the first argument as -i. I will encounter CUDA memory access error as mentioned in title: CUDA Error in SimulateSlices__(), line 2652: an illegal memory access was encountered If I set --debug_gpu flag on, it becomes: CUDA error at /homes/bkainz/cudarecon/source/reconstructionGPU2/reconstruction_cuda2.cu:5110 code=77(cudaErrorIllegalAddress) "cudaMemcpy(addon, dev_addon_[0].data, dev_addon_[0].size.x*dev_addon_[0].size.y*dev_addon_[0].size.z*sizeof(float), cudaMemcpyDeviceToHost)" whilst other planes, namely transverse and coronal, will work fine.

More precisely If I run ~/fetalReconstruction/bin/linux64/SVRreconstructionGPU -o sag.nii -i ./3-t2tsesag-87368_t2_tse_sag_20110707114731_3.nii ./4-t2tsetra-00702_t2_tse_tra_20110707114731_4.nii ./5-t2tsecor-03471_t2_tse_cor_20110707114731_5.nii --debug_gpu --debug 1 &> ConsoleLog.txt the output is: ConsoleLog.txt other logs are: log-registration-error.txt log-reconstruction.txt log-registration.txt

the whole working directory including input and intermediate output are compressed as: PROSTATEx_000.tar.gz

bkainz commented 6 years ago

hmm, good question that might be difficult to answer. Sorry! did you check if you perhaps run out of GPU memory? It looks like you are using the binaries without recompiling. There have been many Cuda and Nvidia driver versions since we published this, so I would try to recompile the code on CUDA 9 or 10 and start debugging from there. Unfortunately, we are lacking the resources to do proper code maintenance. It would be also good to find somebody who will kindly volunteers for assembling a docker container to make this implementation of the algorithm more future proof.

kaltu commented 6 years ago

did you check if you perhaps run out of GPU memory?

Yes, I checked. I have GTX 1080Ti with 11GB ram, and the SVR typically use at most only 1GB of them.

It looks like you are using the binaries without recompiling. There have been many Cuda and Nvidia driver versions since we published this, so I would try to recompile the code on CUDA 9 or 10 and start debugging from there.

Now I have recompiled the source code provided in the source directory with CUDA 10.0 and nVIDIA driver 410.73 on ubuntu 18.04. The problem still persists.

Unfortunately, we are lacking the resources to do proper code maintenance. It would be also good to find somebody who will kindly volunteers for assembling a docker container to make this implementation of the algorithm more future proof.

It looks like a bug within the implementation detail. I recompiled it and the behavior is not changing. I am not sure if a docker container may help for this case. CUDA Error in SimulateSlices__(), line 2656: an illegal memory access was encountered and CUDA error at /home/ka/projects/fetalReconstruction-master/source/reconstructionGPU2/reconstruction_cuda2.cu:5114 code=77(cudaErrorIllegalAddress) "cudaMemcpy(addon, dev_addon_[0].data, dev_addon_[0].size.x*dev_addon_[0].size.y*dev_addon_[0].size.z*sizeof(float), cudaMemcpyDeviceToHost)"

bkainz commented 6 years ago

yes, a docker container would be great! I'll try to convert one of our old servers into a container as soon as I find time.

kaltu commented 6 years ago

I ran into the bug in another computer with 1080Ti and CUDA 9.2. And then I pulled out the SSD in frustration and insert to my old laptop with 950m. Then, whoa la! the same OS same driver same CUDA and same data with same command, only the card changed from 1080Ti to good old 950m. It works, though much more slowly, but it works.

I am guessing the 11GB ram on 1080Ti cause some memory address calculation bugs in the SVR. Which isn't an issue with the old cards which have 4, 6, 8GB of VRAM.

Can you check if the odd number of VRAM supported correctly? I have only 1080Ti with odd number VRAM. Hope someone have a 2080Ti or 1060 3GB version can confirm this is the case.

If that is the case, then creating a docker container may not help at all.

bkainz commented 6 years ago

great! thanks for sharing! I guess the problem is rather caused by different compute architectures. I'll also put this on the TODO list.

lindehesse commented 5 years ago

Is there already a (temporary) solution for this problem? I encountered the same problem when using the AppImage file. I have a Titan XP with 12 GB.

bkainz commented 5 years ago

no, sorry, I haven't had time to look into this yet. any help would be appreciated. Did you find a workaround in the meantime?

dittothat commented 5 years ago

I thought that a containerized version of this code would expedite getting it set up for use at our site and others. I wrote a dockerfile starting from an nvidia docker runtime container. I compile the code therein. This is the current image on dockerhub. However, I get the error that started this thread when I run SVR... and PVR... also throws an error. I felt like I was really close to getting a more portable version of this software up and running, but it seems that newer compute architectures even when used by containerized code give it grief. I thought an older CUDA version (starting with a CUDA 8.0 container) might work but it did to help with the error. I am really interested in using this software, and if we can get a docker image running it would be much more portable for anyone who wants to run it. I missed the discussion of the appimage version. I would be happy for anything that is portable and that works! Thanks.

bkainz commented 5 years ago

thank you very much for your great effort! Did you update the CMakeLists.txt file with the new CP archs? If so and it still doesn't work it might well be that some parts unintentionally exploited some (old) CP specific features. I am planning a student project about this for next term, properly combining it with SVRNet. However, no real ressources for maintenance in academia, sorry.