yousefmoazzam commented 5 months ago

As part of #361, the FBP memory estimation was improved to more accurately represent what memory allocations the FBP method does. However, there is still some memory being allocated that is not yet known where in the method it is happening.

For now, a multiplier of 2 of the output of the 1D RFFT has been added to bump the memory estimation: https://github.com/DiamondLightSource/httomo/blob/0b5f020175b23388da9bea5940b1e246c6e80be0/httomo/methods_database/packages/external/httomolibgpu/supporting_funcs/recon/algorithm.py#L103-L112

and this allows 80GB data to be put through the method in httomo without issue.

However, from observations of using cupy's LineProfileHook is was determined that it's most likely not the case that the 1D RFFT is creating more than one array. Therefore, more investigation is needed to determine what is causing the extra memory allocated.

team-gpu commented 4 months ago

This has been addressed in #393 . Specifically, I've been tracing all individual allocations and digged into the astra toolbox as well. The estimator code has been adjusted to reflect the original code structure better. The full details are as follows, for the test case with the following input data size: (1801, 5, 2560), i.e. 5 slices in sinogram view. The recorded allocations are as follows:

The input size at the start is captured trivially
For the filtersinc function, these are the allocations that happen:
- FFT plan: 184,495,104 bytes, correctly estimated
- complex output of the FFT of half the width: 92,283,392 bytes, which is 18,456,678.4 per slice. It's a tiny bit bigger than estimated, looks like some padding for alignment (128 bytes)
- filter memory: 5,632 bytes (correctly estimated)
- IFFT plan: same size as above for FFT
- output of IFFT: 92,211,200 bytes. This is the same as the input size, as expected
- at exiting the function, the memory for the complex frequency domain data as well as filter is freed as expected
then the FFT plans (both) are destroyed, freeing the memory for them
then there's a swapaxis which allocates input_size again: 92,211,200
also the space for the recon output is allocated as expected: 28,800,000 (1200x1200 of float32)
then it calls the astra toolbox code, which internally creates a cudaArray of the same size as the input for texture access. This is an allocation of around the input size (give or take a few bytes), and we assume that (could not be tracked with memory hook)
this cudaArray is then freed again in astra, which we take into account
the check_kwargs function applies a circular mask, and this is heavy on lots of tiny allocations and frees (won't list them all here). The biggest one is an allocation of a 1200 x 1200 int64 array, which we'll use for the estimates as a fixed cost independent of slices
at the end of the astra call wrappers, it frees 2x the input size, which would be the swapaxis allocation above and the output of the filtersync, as that is going out of scope and not used anymore
then there's another swapaxis call in the return value, which allocates another 28,800,000, i.e. 1200 x 1200 x 5 x float32
The second astra-related part seems to be almost always smaller than the filtersync part at the start, and memory is re-used. But the estimator has a max(...) call to make sure the max of both is considered.

dkazanc commented 3 months ago

DiamondLightSource / httomo

Investigate extra memory used by FBP whose origin is still unknown #365

417 resolves the issue