Problem with the plugin after Torch memory error

andreabassi78 commented 2 years ago

I had problems when processing large stacks with Pytorch. This is unavoidable because I have small GPU memory. However, when I go back to Numpy processing (selecting the Combo_box) I still get an error related to Torch (See Traceback below). Apparently, some torch functions are called also when Numpy is used. I believe this happens because Pytorch is available in my system.

File "C:\Users\andrea\OneDrive - Politecnico di Milano\Documenti\PythonProjects\NapariAppsDeployed\napari-sim-processor\src\napari_sim_processor_sim_widget.py", line 769, in calibration self.h.calibrate_pytorch(imRaw,self.find_carrier.val) File "C:\Users\andrea\OneDrive - Politecnico di Milano\Documenti\PythonProjects\NapariAppsDeployed\napari-sim-processor\src\napari_sim_processor\baseSimProcessor.py", line 118, in calibrate_pytorch self._calibrate(img, findCarrier, useTorch=True) File "C:\Users\andrea\OneDrive - Politecnico di Milano\Documenti\PythonProjects\NapariAppsDeployed\napari-sim-processor\src\napari_sim_processor\baseSimProcessor.py", line 124, in _calibrate self._allocate_arrays() File "C:\Users\andrea\OneDrive - Politecnico di Milano\Documenti\PythonProjects\NapariAppsDeployed\napari-sim-processor\src\napari_sim_processor\baseSimProcessor.py", line 97, in _allocate_arrays self._carray_torch = torch.zeros((self._nsteps, 2 * self.N, self.N + 1), dtype=torch.complex64, device=self.tdev) RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 2.00 GiB total capacity; 437.24 MiB already allocated; 0 bytes free; 480.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pptman commented 2 years ago

It looks like you are still calling calibrate_pytorch() even though pytorch is not selected. Looking at the code, I cannot see why this should happen.

However, the CUDA out of memory error is caused because torch has already allocated a load of gpu memory and not released it properly. One of the problems with both torch and cupy is that once they have allocated gpu memory they will then hang onto it even if the arrays used have moved out of scope and have been garbage collected. They also use some (or most!) of this memory for fft plans including work arrays, that cupy in particular likes to hang on to. There is a method in baseSimProcessor called empty_cache() that will do its best to clear up all this memory usage. The question is when to call it.

At the moment it is called at the beginning of both the batchreconstructcompact_torch() and batchreconstructcompact_cupy() methods, in order to make space for these methods as they are the ones that need lots of memory. As emptying the cache does take a significant number of ms it is not really a good idea to call it everywhere so there are a number of options.

Call it as part of the calibration routine. Actually your crash was in allocate_arrays which does always pre-allocate some small working arrays on the GPU.
Call it at the end of the batch reconstruct methods as well as at the beginning - this only helps if the method completes successfully.
Try to catch the CUDA out of memory exception and empty the arrays then, either within baseSimProcessor or within _Sim_Widget.

I think that I will try a combination of all three, using a Try/Except in the batch reconstruction methods to clear the cache on the exception and then re-raise the exception to let _Sim_Widget know that an error occurred.

At the same time it would be good to work out why calibrate_torch was called in the first place even though numpy was selected.

andreabassi78 commented 2 years ago

I checked again, it was my mistake, I made a change to the calibration method and excluded numpy. There is no problem with the current version on github. Agree that it'd be useful to clear the memory. Catching the RuntimeError should work fine. I would do it in the baseSIM processor to keep the widget a bit more readable. In any case, I change the label of this issue 'Enhancement', I was actually causing the bug, very sorry!

pptman commented 2 years ago

It is probably still worth fixing as we are currently not doing anything to release GPU the memory once we are done with it.

I have created a branch to look at this called gpumemoryclear. I am trapping the exceptions in baseSimProcessor but there are still problems clearing the gpu memory after the exception. It is working relatively well in Pytorch, but in Cupy the allocated gpu memory does not seem to be recoverable after the error and the only solution seems to be to restart.

Interestingly, they both actually raise different exceptions: Pytorch raises RuntimeError but Cupy raises OutOfMemoryError.

pptman commented 2 years ago

I have raised an issue on the CuPy github to see if there is any answer to the CuPy memory problem.

pptman commented 2 years ago

I believe I now have a solution. Some suggestions were made in the CuPy discussion mentioned above, but that issue is still unresolved. Basically I could not clear the Cuda memory properly while an exception was in progress. I have circumvented this by encapsulating the batchreconstructcompact methods in separate wrapper functions and returning a message when an exception has occurred, that then raises a new exception (actually an assert) in the wrapper function once the Cuda memory has been cleared. Things should fail more gracefully now. I have merged the changes back into the main branch and will close this issue.

andreabassi78 / napari-sim-processor

Problem with the plugin after Torch memory error #12