DiamondLightSource / httomo

High-throughput tomography pipeline
https://diamondlightsource.github.io/httomo/
Other
5 stars 4 forks source link

`DezingingWrapper` keeping projections in GPU memory while transferring darks/flats to GPU memory causes CUDA OOM #245

Closed yousefmoazzam closed 6 months ago

yousefmoazzam commented 6 months ago

On GPUs with small memory, but large enough to hold all darks + all flats, but not always at the same time as holding a block in GPU memory (depending on the block size), the remove_outlier method fails with a CUDA OOM error.

With 20GB data that has:

pc0074 (which has a GPU with 2GB of memory), is able to hold all darks and all flats in GPU memory. But, if the block size doesn't take the size of the darks/flats into account, then the block will be made too large to be able to fit all the darks and all the flats in GPU memory along with the block.

In feature/transparent-file-store, the state of the DezingingWrapper is such that it keeps a block, all darks, and all flats, in GPU memory before execution is returned to the task runner (which would transfer data to CPU when it needs to be written to the data store): https://github.com/DiamondLightSource/httomo/blob/d9afc1f04e29a375cee17d33033d75f47a99da34/httomo/method_wrappers/dezinging.py#L57-L62

This seems to not agree with what the memory estimation for remove_outlier now is (changes were recently done to memory estimation of remove_outlier in #239), where accounting for darks/flats looks to not be present: https://github.com/DiamondLightSource/httomo/blob/d9afc1f04e29a375cee17d33033d75f47a99da34/httomo/methods_database/packages/external/httomolibgpu/1.2/httomolibgpu.yaml#L12-L20

It seems to be the case that either:

  1. the memory estimation for remove_outlier needs to account for darks/flats
  2. projections need to be transferred to CPU before darks are transferred to GPU, and the darks need to be transferred to CPU (after processing) before the flats are transferred to the GPU
yousefmoazzam commented 6 months ago

The second option may be possible in DezingingWrapper. However, naively attempting to transfer the block to CPU after the projections have been processed (in order to transfer the projections back to CPU before the darks are processed) via block.to_cpu() causes the block as a whole to be seen as "on the CPU", because the block.is_gpu getter is defined in terms of the block.is_cpu getter: https://github.com/DiamondLightSource/httomo/blob/d9afc1f04e29a375cee17d33033d75f47a99da34/httomo/runner/dataset.py#L108-L110

which in turn checks if self._data has a device attribute or not: https://github.com/DiamondLightSource/httomo/blob/d9afc1f04e29a375cee17d33033d75f47a99da34/httomo/runner/dataset.py#L104-L106

self._data would be the projections, and this is transferred to CPU via the block.to_cpu() call.

This all in turn causes the block.darks getter to not transfer them to GPU, because it uses block.is_gpu internally to decide whether or not to transfer to GPU, and block.is_gpu = False after the block.to_cpu() call to transfer projections to CPU.

yousefmoazzam commented 6 months ago

Note that even with 9f60287 as an experimental attempt to transfer + process projections, darks, flats one-by-one in sequence, the remove_outlier method requires enough GPU memory to hold both:

in GPU memory, so the available GPU memory needs to be able to hold double the size of the darks or flats.

dkazanc commented 6 months ago

The solution for that is taken with this commit, where the darks and flats will be processed into 2d arrays on the CPU. Then the smaller arrays can be transferred into dezinger at the same time as blocks to be processed on the GPU.