AndrasKovacs / flatparse

Fast parsing from bytestrings
MIT License
146 stars 12 forks source link

GHC 9.4 support #25

Closed raehik closed 1 year ago

raehik commented 2 years ago

This should be a case of tweaking the machine integer compatibility layer at FlatParse.Internal.UnboxedNumerics to handle the Word#, Word64# change. We might also check out fumieval's word-compat package and see if can be applied here, I never gave it a proper look.

raehik commented 2 years ago

I intend to have a look at this a bit later!

raehik commented 2 years ago

word-compat provides a compatibility layer for new GHC primitives, in terms of the old ones. I think I'd prefer a different approach, where we provide a compatibility package that gives you the new GHC primitives on the old compilers (implemented e.g. by converting all the WordX#s to Word#, and running the platform primitive), and export the regular ones on new compilers.

dminuoso commented 2 years ago

There is three options that I can see Given the example of data Word64 = W64# _

Flatparse currently uses a slight mixture (given it has primop wrappers that either expose sized on all versions, and it has wrappers that switch type depending on version) where word-compat uses (2). Effort-wise, (2) is by far the easiest, (3) exposes these artifacts to the user, while (1) would keep them hidden inside.

raehik commented 2 years ago

Ideally I'd like (1) or (3). I used a mix of them in Flatparse in order to retain simplicity where possible (very hacky).

The problem with (1) is that I can't see how to get a Word64# at all on GHC 9.0! GHC.Exts on base-4.15.1.0 (GHC 9.0) defines it, then never uses it.

I started on supporting GHC 9.4 at #29 that mostly goes with option (3).

dminuoso commented 2 years ago

First, the CPP switched type aliases like Word8'# suffer from PVP instability, namely that updating base can cause breakage To see why, imagine you are using anyWord16# where Word16'# ~ Word#, and after an GHC update from 9.0 to 9.4 it becomes Word16'# ~ Word16#. I do not think this can be addressed, unless any{Word,Int}{8,16,32}# are moved of public API (say an Internal or Unstable module). @AndrasKovacs How important is PVP stability to you?

Then, I've had some time to do research on GHC and talk with Andreas.

In GHC <= 9.0 the sized primtypes (Word8#, Int8#, etc...) are mostly unusable for users. There are mostly no primops available to consume them. I began noticing this while trying to implement shims for word{8,16,32,64}ToWord#. The only way to deal with them, you have to unsafeCoerce# them to something else - in principle this is safe however, as in GHC <= 9.0 sized primtypes are implemented as native primtypes, so you can just unsafeCoerce# between then. Proper Cmm conversions exist, so it wont trip cmm lint. So this would be usable, but this would require documention that the few users who do want to use withAny{Word,Int}{8,16,32} parsers and want to use GHC <= 9.0 must unsafeCoerce# their values into Word# for usability.

However, on newer GHC versions I am not sure whether that would be safe. If it is unsafe, this could have the extremely undesirable effect of inducing crashes or garbage parsers after updating GHC from 9.0 to say 9.4. I am not sure however this is actually the case, this needs more research.

AndrasKovacs commented 2 years ago

As I see the only place where CPP-switched primitive types appear in exports is the "machine integer continuation" functions. Could we just change the types of those to refer to ordinary WordN and IntN instead? I do not think that any unboxing is lost by doing that, because all CPS'd functions must be inlined anyway. If we do this, then there's no CPP-switching in exports anymore.

dminuoso commented 2 years ago

Well, they are not required to be inlined... one can also write withAnyWord8 fun from a user side with {-# NOINLINE fun #-} (or maybe fun is just too large for GHC to inline).

But that shouldn't be a concern of ours - in all reasonable code the unboxing inside the continuation will occur automatically.

I second providing boxed types in the continuation parsers, for us GHC will get rid of the intermediate boxes, and it provides a stable interface.

The one thing we should look at then, is how to handle Word64# internally. Ideally, we should use Word64# in older versions. While not visbile on haddock, Word64# and related primops are available on 32 bit systems. With that, we could even get 32-bit support back.

raehik commented 2 years ago

OK, thanks very for discussing unboxing -- I don't understand well how GHC unboxes things and was aiming for easily-provable performance. Boxing the machine int parsers fixes most issues. In order to fully remove CPP, some other areas require boxing too:

scan64# :: Word64 -> Parser e ()
scan64# (W64# w) = Parser \fp eob s ->
  case indexWord64OffAddr# s 0# of
    w' -> case eqWord64# w w' of
      1# -> OK# () (plusAddr# s 8#)
      _  -> Fail#

eqWord64# :: Word64# -> Word64# -> Int# doesn't exist prior to GHC 9.4 . Instead, we remain boxed:

scan64# w = Parser \fp eob s ->
    if   W64# (indexWord64OffAddr# s 0#) == w
    then OK# () (plusAddr# s 8#)
    else Fail#

Now it works on all GHCs, and I can see how it's trivial to unbox. Is this sort of change OK to make across the board? It seems we can get rid of my hacky UnboxedNumerics module this way.


Regarding Word size: @dminuoso do you mean write everything using 8-byte Word64s, and simply accept that we lose performance on 32-bit? I would be very happy if it's that simple.

AndrasKovacs commented 2 years ago

I'm fine with using CPP internally, I'd only like to remove them from exported types. I overall prefer CPP'd primops over class-based overloading across GHC versions. It leaves us with more control and we need CPP internally anyway.

raehik commented 1 year ago

Related Stackage issue: https://github.com/commercialhaskell/stackage/issues/6767

36 adds GHC 9.4 support along with lots of other changes. Would be happy to split the PR into GHC 9.4 and other refactorings if it simplifies the process.