bd-j / forcepho

Generative modeling galaxy photometry for JWST
https://forcepho.readthedocs.io
19 stars 4 forks source link

Half precision computations #29

Open lgarrison opened 5 years ago

lgarrison commented 5 years ago

CUDA supports efficient computation with half-precision (16-bit) floats. This is probably enough precision for the pixel data in our problem. We might do this with CUDA's half2 type, but this would require using SIMD instrinsics rather than plain math operators.

This will require care in two ways: making sure that our values stay within 6e-5 to 6e4, and making sure that our operation count is small enough that the build-up of round-off/truncation error is acceptable. The signficand precision is 1/2048, or 5e-4.

deisenstein commented 5 years ago

Some further notes:

A patch may have a few hundred pixels; rounding off to 10% of a pixel would not be good. But the pixels values themselves are integers and hence have no round-off. I think we could express the center of the Gaussian with the nearest integer and then a fractional part; then we could do the dx,dy math with much less loss of precision.

It seems like the half2 library doesn't provide binary operators, but rather a ton of functions. So this is not as simple as changing PixFloat; we have to recode the kernel. And there is a FMA, so we'd need to be thoughtful about that coding.

It seems that we'll need to store ImageGaussian items as a half2 doublet, so we don't save any space in there. I think we'd compute the convolution in float and then compress to half2 only when inserting into the Image Gaussian. That part looks easy to code.

The comparison operators require both halves to be true, which is fine for superpixel comparisons.

There are some differences in the derivative computation, but I'm not convinced its particularly scary.

The accumulators could stay in float, but would start the reductions from high2float(x)+low2float(x). Easy.

The pixel indexing in exposure_N and exposure_start would need to be adjusted, because we're dealing with pixel pairs. Easy to do on the GPU at the top of the pixel loop.

I'm imagining that we'll have a set of preprocessor macros and definitions that get redefined between float and half2.

ifdef HALF

define MUL(a,b) __half2mul(a,b)

elsif

define MUL(a,b) (a)*(b)

endif

On Fri, Jun 7, 2019 at 5:57 PM Lehman Garrison notifications@github.com wrote:

CUDA supports efficient computation with half-precision (16-bit) floats. This is probably enough precision for the pixel data in our problem. We might do this with CUDA's half2 type, but this would require using SIMD instrinsics rather than plain math operators.

This will require care in two ways: making sure that our values stay within 6e-5 to 6e4, and making sure that our operation count is small enough that the build-up of round-off/truncation error is acceptable. The signficand precision is 1/2048, or 5e-4.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bd-j/forcepho/issues/29?email_source=notifications&email_token=AAPQRGXQLMIFOZST6LKUSZTPZLKTZA5CNFSM4HV36JA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GYLBOCA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPQRGUJICWWDHWYAWZYARTPZLKTZANCNFSM4HV36JAQ .