Float16 type - Githubissues

rwgardner commented 11 years ago

This is a request for support for half-precision floating point numbers (Float16s).

(If there has been any discussion about adding support for these, which I would expect there was, I did not find it.)

Although the precision is low, Float16s are still useful when you have a very large quantity of floating point numbers (which is what we have) and want to reduce memory footprint, cache impact, or disk storage. (Currently, we manually convert our half precision floats with bit manipulations and reinterpretation, but the code would be cleaner if Julia supported them natively.)

Thanks.

nolta commented 11 years ago

LLVM 3.1 added support for half floats, so this should be doable. Marking as 'up for grabs'.

StefanKarpinski commented 11 years ago

Since this is strictly a storage type, very few operations are needed – mostly conversion to and from larger float types.

timholy commented 11 years ago

@rwgardner, my guess is that this will happen sooner if you submit a pull request. ("Up for grabs" is a good choice here, and it basically means "waiting for someone to do it." Since you want the feature...) It's good that you first submitted it as an issue, however, in case there were strong objections; since that doesn't seem to be the case, it looks like the way is clear for you to add this feature.

Some time in the not-too-distant past, support for Int128 was added. Perhaps a good start might be browsing the commit history (with git log and git show) to find out exactly how that was done---it might be a great model for this case.

StefanKarpinski commented 11 years ago

Float16 should be substantially easier than Int128. Up for grabs is more like "waiting for someone to do it and pretty nicely isolated and doable by a determined newcomer."

ViralBShah commented 11 years ago

The cool thing about Int128 was that it was done fully in julia. I believe that to get a fast Float16 implementation, one may need to leverage LLVM's Float16 capabilities in intrinsics.cpp and codegen.cpp.

I believe a first cut implementation can be done by leveraging bitshifts and such the way @rwgardner has already done, and it would be nice to receive that as a pull request as a starting point.

rwgardner commented 11 years ago

Sounds good. I'm not "grabbing" this yet, but I will if I really want it done. (Unfortunately, I don't get paid to work on Julia for the most part, which means I need to do this in my free time. That's something I'd love to do, but in short, a new first baby due any day has been and will be dominating that for a while.)

ViralBShah commented 11 years ago

Is it possible for you to isolate the code that you have already written for Float16 and submit that?

StefanKarpinski commented 11 years ago

Outline of what needs to be done:

[ ] Add intrinsics for floating point truncation and extension to and from 16-bit floats
- can be done either by adding specific intrinsics or generalizing the existing ones
[ ] Add convert methods to/from Float16 and other numeric types
[ ] Add promotion rules for Float16 and other numeric types

@JeffBezanson, any thoughts on whether it's better to add new specific intrinsics (fptrunc16 and fpext32) or generalize the existing ones? I was leaning towards generalizing the existing ones and renaming fptrunc32 => fptrunc and fpext64 => fpext.

ghost commented 11 years ago

If rwgardner is alright with it, I can try implementing Float16. I've wanted to find a way to get my hands dirty in Julia.

ViralBShah commented 11 years ago

@mattgallivan Please jump in. More the merrier. @StefanKarpinski 's outline is basically what needs to be done, and one can follow the Float32 implementation in src and base.

StefanKarpinski commented 11 years ago

Just to expand on what I mean by "generalizing the existing ones", this means turning the fptrunc and fpext intrinsics into versions that aren't specific to bit sizes but use type info to figure out the appropriate sizes and call the corresponding LLVM instructions. We've gradually been moving from specific versions with bit sizes in their names to more generic ones.

ViralBShah commented 11 years ago

The Int stuff already does that and it would be nice to do so with FloatingPoint too. I wonder if we should take this opportunity to also add Float128 at the same time, assuming LLVM supports it.

Keno commented 11 years ago

Since there is no hardware support for quad-precision arithmetic, adding Float128, is quite a bit more complicated.

StefanKarpinski commented 11 years ago

Yeah, that's a whole different can of worms. You actually want to compute with Float128 or it's completely useless. For Float16, it's fine to just be able to store them.

rwgardner commented 11 years ago

@mattgallivan all sounds good. I would love to contribute and would have a lot of fun doing it, but my life is about as insane as it's ever been right now. Hopefully I can contribute in other ways in the future.

You may not want this (I'm sure it could be written more efficiently, etc., and you may want to do it in fortran or C), but here's what I have. It also hasn't been heavily validated yet, but you might use it for validation by comparing it to your code. I haven't done any conversion back to Float16.

bitstype 16 MyFloat16

function convert(::Type{Float32}, val::MyFloat16)
    val = uint32(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint32

    if exp == 0
        if sig == 0
            sign = sign << 31
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
            bit = bit >> 1
            end
            sign = sign << 31
            exp = (-14 - n_bit + 127) << 23
            sig = ((sig & (~bit)) << n_bit) << (23 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
        if sign == 0
                ret = 0x7f800000
            else
            ret = 0xff800000
            end
    else
            ret = 0xffffffff
    end
    else
        sign = sign << 31
    exp  = (exp - 15 + 127) << 23
    sig  = sig << (23 - 10)
    ret = sign | exp | sig
    end
    return reinterpret(Float32, ret)
end

function convert(::Type{Float64}, val::MyFloat16)
    val = uint64(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint64

    if exp == 0
    if sig == 0
            sign = sign << 63
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
                bit = bit >> 1
            end
            sign = sign << 63
            exp = (-14 - n_bit + 1023) << 52
            sig = ((sig & (~bit)) << n_bit) << (52 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
            if sign == 0
                ret = 0x7ff0000000000000
            else
                ret = 0xfff0000000000000
            end
        else
            ret = 0xffffffffffffffff
        end
    else
        sign = sign << 63
        exp  = (exp - 15 + 1023) << 52
        sig  = sig << (52 - 10)
        ret = sign | exp | sig
    end

    return reinterpret(Float64, ret)
end

We could convert to only Float32 or Float64 and then use existing code to convert between those. It seems more efficient to convert to/from both directly in most cases, but it may not be on some architectures, partly depending on whether there is hardware support for converting between Float32 and Float64. (I don't know if that's something floating point units typically support or not.)

ViralBShah commented 11 years ago

@StefanKarpinski Would it be good to start off with this as a pure julia implementation and get it in base to begin with?

ViralBShah commented 11 years ago

Until the LLVM bug is sorted out, it may be worthwhile to put @rwgardner 's julia implementation in Base. That way, at least the storage format can be used, and the conversions could be potentially faster when the LLVM issue is fixed.

@loladiro Does LLVM 3.3 fix the Float16 bugs?

StefanKarpinski commented 11 years ago

Even using @rwgardner's conversions, the following patch unfortunately still causes LLVM failures:

https://gist.github.com/StefanKarpinski/9092d04bc24c44493d08

julia> float16(1.5)
LLVM ERROR: Cannot select: 0x104151b10: ch = store 0x102070910, 0x10421df10, 0x104231d10, 0x10434d410<ST2[%14]> [ORD=77165] [ID=35]
  0x10421df10: f16,ch = load 0x10434dc10, 0x102070010, 0x10434d410<LD2[FixedStack0]> [ORD=77156] [ID=27]
    0x102070010: i64 = FrameIndex<0> [ORD=77155] [ID=4]
    0x10434d410: i64 = undef [ORD=77150] [ID=2]
  0x104231d10: i64 = add 0x104233910, 0x1041a7810 [ORD=77163] [ID=33]
    0x104233910: i64,ch,glue = CopyFromReg 0x104087a10, 0x104088010, 0x104087a10:1 [ORD=77157] [ID=32]
      0x104088010: i64 = Register %RAX [ORD=77157] [ID=10]
      0x104087a10: ch,glue = callseq_end 0x10434da10, 0x104264310, 0x104264310, 0x10434da10:1 [ORD=77157] [ID=31]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x10434da10: ch,glue = X86ISD::CALL 0x104279410, 0x104232910, 0x104085410, 0x10417a710, 0x104279410:1 [ORD=77157] [ID=30]
          0x104232910: i64 = X86ISD::Wrapper 0x104085310 [ID=16]

Keno commented 11 years ago

You'll still want to leave in the disable in the compiler, otherwise LLVM will generate bad code. LLVM 3.3 does not fix this.

JeffBezanson commented 11 years ago

Yes, with this implementation no compiler changes are needed; it's just a 16-bit bitstype.

StefanKarpinski commented 11 years ago

Ok, if someone wants to finish this, I'm away for the day.

ViralBShah commented 11 years ago

Bump.

Keno commented 11 years ago

@StefanKarpinski do you just want to apply your patch?

StefanKarpinski commented 11 years ago

I don't think just applying the patch works. There was a bunch of changes it needed to work.

ViralBShah commented 11 years ago

It would be nice to have a nicer show() method for float16. Asking the question here in case this was done by design.

julia> float16(100.25)
Float16(0x5644)

StefanKarpinski commented 11 years ago

Printing 16-bit floats correctly and minimally is quite non-trivial. Our 32-bit and 64-bit float printing are handled by the double-conversion library which does not support 16-bit floats. It might be possible to figure out a hack that approximates correct minimal Float16 printing using the printing routines for Float32, but it's not obvious how.

ViralBShah commented 11 years ago

I wonder what is going on here:

julia> a = float16(rand(5,5))
5x5 Float16 Array:
 0.445801  0.154785  0.431641   0.384521  0.188354 
 0.4646    0.281006  0.766602   0.563965  0.0402222
 0.685059  0.92627   0.921875   0.933594  0.468994 
 0.841797  0.582031  0.0185242  0.481934  0.151367 
 0.348877  0.952637  0.672852   0.864746  0.166138

JeffBezanson commented 11 years ago

Float16 printing has several problems right now, e.g.

julia> print_shortest(STDOUT,NaN16)
NaN32

(plus NaN16 does not work properly) I'm about to commit some fixes.

showcompact has a fallback definition that is printing the Float16s in that array by converting them to Float64. The question is whether we should print the f0 suffix. For now I'll say that is specific to Float32, and leave it off.

JuliaLang / julia

Float16 type #3467