f16<->double - Githubissues

gambiteer commented 2 months ago

Your code to convert between f16 and f64 is a bit opaque to me:

double sexp_half_to_double(unsigned short x) {
  unsigned int e = (x&0x7C00)>>10,
    m = (x&0x03FF)<<13,
    v = float_as_int((float)m)>>23;
  return int_as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000)));
}

unsigned short sexp_double_to_half(double x) {
  unsigned int b = float_as_int(x)+0x00001000,
    e = (b&0x7F800000)>>23,
    m = b&0x007FFFFF;
  return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF;
}

It appears that it may be much faster than my code (which uses flscalbn, flilogb, and many tests and jumps).

I have a test code for my f16->double and double->f16, where the f16 representation is an unsigned 16-bit int. So two questions:

Can you give a reference for your code?
Are the C routines sexp_double_to_half and sexp_half_to_double exported to the Scheme level, or are they just used internally to support f16vector-ref and f16vector-set?

Thanks.

Brad

ashinn commented 2 months ago

Source: https://arxiv.org/abs/2112.08926

It was actually a place holder until I could derive something myself but I may just leave it as is.

ashinn commented 2 months ago

There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc.

gambiteer commented 2 months ago

Thanks. I used Gambit's C interface to test the code.

It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even.

It also gets the encoding of infinities and NaNs wrong, it should have

> (double->f16 +inf.0)
31744
> (double->f16 +nan.0)
32767

(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns

> (double->f16 +inf.0)
32767
> (double->f16 +nan.0)
32768
> (double->f16 -0.0)
32768
> (f16->double 32768)
-0.

32767 is the encoding for +nan.0 and 32768 is the encoding for -0.0.

gambiteer commented 2 months ago

I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt . It contains three lists. The first is a list of doubles to be converted to f16 codes. The second is a list of the codes associated with those doubles. And the third is a list of reconstructed doubles from those codes. If you negate all the data in the first list, the codes should just increase by 32768 (except perhaps for +nan.0).

I suspect that because of the implicit double->float conversion in sexp_double_to_half, the code might be affected by double rounding, too.

ashinn commented 2 months ago

The rounding I'm not too concerned about, but I'll fix the nan handling.

gambiteer commented 2 months ago

OK, here are some examples where the finiteness of your results and my results differ. In each row, the last of the three numbers is the input, the first is your result (double->f16->double), the second is my result. 65520.0 should overflow to +inf.0 in this format. The second group has the sign changed withflcopysign from SRFI 144 (so that doesn't make a difference in the print output of +nan.0).

finiteness_difference 131008.0 +inf.0 +inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference 65536.0 +inf.0 65520.0

finiteness_difference -131008.0 -inf.0 -inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference -65536.0 -inf.0 -65520.0

ashinn commented 2 months ago

Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:

$ chibi-scheme -msrfi.160.mini -p'(f16vector-ref #f16(+inf.0) 0)'
131008.0

gambiteer commented 2 months ago

I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface.

Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest.

ashinn / chibi-scheme

f16<->double #988