Open gambiteer opened 6 months ago
Source: https://arxiv.org/abs/2112.08926
It was actually a place holder until I could derive something myself but I may just leave it as is.
There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:
Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);
Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);
which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc.
Thanks. I used Gambit's C interface to test the code.
It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even.
It also gets the encoding of infinities and NaNs wrong, it should have
> (double->f16 +inf.0)
31744
> (double->f16 +nan.0)
32767
(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns
> (double->f16 +inf.0)
32767
> (double->f16 +nan.0)
32768
> (double->f16 -0.0)
32768
> (f16->double 32768)
-0.
32767 is the encoding for +nan.0
and 32768
is the encoding for -0.0
.
I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt .
It contains three lists.
The first is a list of doubles to be converted to f16
codes.
The second is a list of the codes associated with those doubles.
And the third is a list of reconstructed doubles from those codes.
If you negate all the data in the first list, the codes should just increase by 32768 (except perhaps for +nan.0
).
I suspect that because of the implicit double->float conversion in sexp_double_to_half
, the code might be affected by double rounding, too.
The rounding I'm not too concerned about, but I'll fix the nan handling.
OK, here are some examples where the finiteness of your results and my results differ.
In each row, the last of the three numbers is the input, the first is your result (double->f16->double), the second is my result.
65520.0 should overflow to +inf.0
in this format. The second group has the sign changed withflcopysign
from SRFI 144 (so that doesn't make a difference in the print output of +nan.0
).
finiteness_difference 131008.0 +inf.0 +inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference 65536.0 +inf.0 65520.0
finiteness_difference -131008.0 -inf.0 -inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference -65536.0 -inf.0 -65520.0
Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:
$ chibi-scheme -msrfi.160.mini -p'(f16vector-ref #f16(+inf.0) 0)'
131008.0
I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface.
Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest.
Your code to convert between
f16
andf64
is a bit opaque to me:It appears that it may be much faster than my code (which uses
flscalbn
,flilogb
, and many tests and jumps).I have a test code for my
f16->double
anddouble->f16
, where thef16
representation is an unsigned 16-bit int. So two questions:sexp_double_to_half
andsexp_half_to_double
exported to the Scheme level, or are they just used internally to supportf16vector-ref
andf16vector-set?
Thanks.
Brad