Not enough GPU ram ... make a "slow" option

VolkerH / Lattice_Lightsheet_Deskew_Deconv

Open-source, GPU accelerated code for deskewing and deconvolving lattice light sheet data

Other

22 stars 3 forks source link

Not enough GPU ram ... make a "slow" option #31

Closed VolkerH closed 5 years ago

VolkerH commented 5 years ago

both gputools (which used reika for fft) and flowdec run out of GPU memory when trying to process a stack of size (151, 800, 600).

Depending on what I am trying do exactly the error message in tensorflow either shows up when initializing the batch cufft plan or later on when allocating space for a tensor.

One of the error messages I saw is that it was trying to allocate a tensor of size 256, 1024, 1024. When I crop the volume by 23 pixes in Z (this would correspond to 128 z slices), everything works fine. When I only crop 22 pixels it fails.

flowdec rounds up the sizes to the next size where the fastest FFT can be performed. This appears to be very generous rounding. It would be nice to be able to trade some speed for the ability to process such volumes. I should look into adding an option to round up to the next size for which an FFT can be performed, even if it is not optimal in terms of speed.

VolkerH commented 5 years ago

Also tested whether real_mode_fft requires less video ram but this is not the case.

Coincidentally, I just noticed this: https://github.com/tlambert03/pycudadecon/issues/7

So I guess I don't need to test with pycudadeconv either

VolkerH commented 5 years ago

this is the code I need to look at

def optimize_dims(dims, mode): in https://github.com/hammerlab/flowdec/blob/master/python/flowdec/fft_utils_tf.py

fall back to Bluestein for arrays that are too large.

VolkerH commented 5 years ago

Turns out that was easy. The mode that gets passed into optimize_dims can be set when initializing the deconvolver class in flowdec using the named argument pad_mode. In the specific example mentioned above this allows deconvolution on the GPU. Haven't benchmarked it but it is not much slower. However, I appear to obtain more artefacts at the image boundary.

Will add this as a commandline option

dmilkie commented 5 years ago

You should always pad by at least a PSF extent. The FFT essentially “wraps” the edges together. E.g. If you don’t pad, what is at the top edge will bleed and appear at the bottom. The best is “mirror padding”. Zero padding will create ringing artifacts.

-Dan

On Apr 5, 2019, at 6:54 AM, VolkerH notifications@github.com wrote:

Turns out that was easy. The mode that gets passed into optimize_dims can be set when initializing the deconvolver class in flowdec using the named argument pad_mode. In the specific example mentioned above this allows deconvolution on the GPU. Haven't benchmarked it but it is not much slower. However, I appear to obtain more artefacts at the image boundary.

Will add this as a commandline option

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

VolkerH commented 5 years ago

@dmilkie Thanks for the comment and I agree: that is exactly the type of artefact that I was seeing when disabling padding.

I will need to check a few more things in the flowdec source code, I believe if I provide images that already have optimal dimensions for the FFT it will not perform any padding by default (but I might be wrong). So far, most of the stacks I put through there had dimensions that would have been very generously rounded up and padded (not sure about the default fill strategy, also will have to check the source code) so the output always looked quite artefact-free.

dmilkie commented 5 years ago

Nice. I believe there is a “mirror” pad option in there.

Regarding VRAM bloat An other thing to check is the precision of the calculations. If they are double that’s too much.

Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.

-Dan

On Apr 6, 2019, at 12:27 AM, VolkerH notifications@github.com wrote:

@dmilkie Thanks for the comment and I agree: that is exactly the type of artefact that I was seeing when disabling padding.

I will need to check a few more things in the flowdec source code, I believe if I provide images that already have optimal dimensions for the FFT it will not perform any padding by default (but I might be wrong). So far, most of the stacks I put through there had dimensions that would have been very generously rounded up and padded (not sure about the default fill strategy, also will have to check the source code) so the output always looked quite artefact-free.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

eric-czech commented 5 years ago

FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.

VolkerH commented 5 years ago

Nice. I believe there is a “mirror” pad option in there. Regarding VRAM bloat An other thing to check is the precision of the calculations. If they are double that’s too much. Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.

I am impressed about the level of such nice tweaks that are implemented in cudaDecon. I had already looked at whether it is possible to use lower precision (such as float16/complex32) to save VRAM. However, tensorflow does not provide that level of control it is either complex64 or complex128 (see https://www.tensorflow.org/api_docs/python/tf/signal/fft3d). Also, tensorflow doesn't give much option over the FFT plan creation/storage.

I was wondering whether tensorflow could automatically fall back to storing arrays in CPU memory when there is not enough VRAM. From what I understand this is happening for operations which are not supported on the GPU if the option allow_soft_placement is given, but apparently tensorflow does not do this based on memory considerations.

FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.

Yes, I did see that and setting pad_mode='none' allowed me to deconvolve a volume that otherwise gave error messages due to lack of VRAM. To ensure optimum quality whenever possible, one would probably have to have a hiearchy like in this pseudocode:

increase input size (origsize) by at least half the width of the PSF along each dimension  -> (newsize)
increase newsize to nearest size optimal for speed -> (optimalnewsize)
try allocating graph for optimalnewsize:
Catch vram_exception:
     try allocating graph for newsize:
          catch vram_exception:
              try_allocating graph for origsize (warn about potential artefacts)
     finally:
          fall back to storing in main memory rather than VRAM.

This gets complicated rather quickly, but I think I will implement the first part of it to ensure consistent quality regardless of the input size.

dmilkie commented 5 years ago

Yeah. I’ve thought about decon performance quite a bit. Another thing to check is if these other methods are using the accelerated RL. (Biggs and Andrews) as we do ( and matlab does). The acceleration does take extra memory to compute, but the performance is worth it IMHO. That Biggs paper is pretty old and there maybe a higher performance flavor or something that uses less memory.... or maybe the Biggs and Andrew workspace arrays might be able to share memory. I should look into that.

-Dan

On Apr 7, 2019, at 7:56 PM, VolkerH notifications@github.com wrote:

Nice. I believe there is a “mirror” pad option in there. Regarding VRAM bloat An other thing to check is the precision of the calculations. If they are double that’s too much. Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.

I am impressed about the level of such nice tweaks that are implemented in cudaDecon. I had already looked at whether it is possible to use lower precision (such as float16/complex32) to save VRAM. However, tensorflow does not provide that level of control it is either complex64 or complex128 (see https://www.tensorflow.org/api_docs/python/tf/signal/fft3d). Also, tensorflow doesn't give much option over the FFT plan creation/storage.

I was wondering whether tensorflow could automatically fall back to storing arrays in CPU memory when there is not enough VRAM. From what I understand this is happening for operations which are not supported on the GPU if the option allow_soft_placement is given, but apparently tensorflow does not do this based on memory considerations.

FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.

Yes, I did see that and setting pad_mode='none' allowed me to deconvolve a volume that otherwise gave error messages due to lack of VRAM. To ensure optimum quality whenever possible, one would probably have to have a hiearchy like in this pseudocode:

increase input size (origsize) by at least half the width of the PSF along each dimension -> (newsize) increase newsize to nearest size optimal for speed -> (optimalnewsize) try allocating graph for optimalnewsize: Catch vram_exception: try allocating graph for newsize: catch vram_exception: try_allocating graph for origsize (warn about potential artefacts) finally: fall back to storing in main memory rather than VRAM. This gets complicated rather quickly, but I think I will implemented the first part of it to ensure consistent quality regardless of the input size.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

VolkerH commented 5 years ago

The Biggs Thesis from the U of Auckland (https://researchspace.auckland.ac.nz/handle/2292/1760). I skimmed it but haven't found the time to read it.

However, now that your cudaDeconv code has been open-sourced it becomes a bit more difficult to justify putting the effort in (it is fun though and the lessons learnt will definitely be useful in other projects). I think I need to lock down the feature set I want for now and create a useable and easily installable version.

Back to VRAM utilization: The approach I am using of deskewing on the raw data (by resampling and skewing the PSF) can potentially save considerable memory, I could also try that with cudaDeconv.

dmilkie commented 5 years ago

a bit more difficult to justify putting the effort in

Cool! I'm happy to hear it. :)

The Biggs Thesis

Give his paper a shot. It's pretty readable. https://doi.org/10.1364/AO.36.001766

he approach I am using of deskewing on the raw data (by resampling and

skewing the PSF) can potentially save considerable memory, I could also try that with cudaDeconv.

Right. I think you could initially give this a go by just first running the cudaDeconv with skewed data and a skewed PSF. (be aware somewhere, maybe in OTFgen.exe or cudaDeconv.exe, there is some rotational averaging (i.e. assume that PSF has some symmetry) there might be a command line switch to turn this on/off). Then run cudaDeconv again with itereations=0 to just deskew the data. (or use whatever transform tool you already have).

-Dan

On Sun, Apr 7, 2019 at 9:39 PM VolkerH notifications@github.com wrote:

The Biggs Thesis from the U of Auckland ( https://researchspace.auckland.ac.nz/handle/2292/1760). I skimmed it but haven't found the time to read it.

However, now that your cudaDeconv code has been open-sourced it becomes a bit more difficult to justify putting the effort in (it is fun though and the lessons learnt will definitely be useful in other projects). I think I need to lock down the feature set I want for now and create a useable and easily installable version.

Back to VRAM utilization: The approach I am using of deskewing on the raw data (by resampling and skewing the PSF) can potentially save considerable memory, I could also try that with cudaDeconv.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/issues/31#issuecomment-480654681, or mute the thread https://github.com/notifications/unsubscribe-auth/AQzqb42PVBNJSrfaFocTgkPq5tM1thUyks5vep3GgaJpZM4ceI-Q .

VolkerH commented 5 years ago

Thanks for the link to the paper and the suggestions.

I will close this issue as I've addressed some of the discussed issues via this branch https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/pull/33