AndrasKovacs / flatparse

Fast parsing from bytestrings
MIT License
146 stars 12 forks source link

ParserIO? #32

Closed dminuoso closed 1 year ago

dminuoso commented 1 year ago

Having IO in a parser might be a useful thing for some of the following reasons:

a) keeping non-trivial state via IORef, while keeping nice small unboxed return types b) logging during parsing c) to efficiently poke into mutable buffers while parsing

So I came up with this type

https://gist.github.com/dminuoso/6de513c3e98ba67588957c19ce4eacbd

This allows us to:

a) Run full blown parser over IO b) Unsafely run IOful parsers in a pure parser c) Reuse all the Parser primitives via liftP without having to rewrite them again. d) Throw exceptions to generate uncatchable errors (say assertions)

If we want this, I do not think a variant for Stateful is really necessary, given that ParserIO essentially subsumes it. You could use ReaderT Env (or implicit params) for a convenient reader environment, and IORefs give you limitless

If you're fine with this, I will move forward with a PR and some tests.

raehik commented 1 year ago

This seems very very cool to me. Would be keen to look at how it can be exploited. possible flatparse parser for aeson...? (edit: aeson seems to be stateless anyway)

AndrasKovacs commented 1 year ago

I thought about threading State# when first defining the Parser type but then skipped it because I didn't need it. I do agree that it can be useful. There should be roughly no disadvantage to always threading State#, because it's erased from generated code and it does not really add more sequentiality than what we already have. There could be some hypothetical blocking of GHC optimizations but I don't expect this to be significant.

If we want to thread State# then it should be just included in all basic Parser definitions, without any additional code duplication. I haven't yet thought deeply about the API for running parsers in IO and running IO in parsers. One suggestion:

Stateful should not be dropped because it is significantly more efficient than what can be achieved by mutable refs. The point of Stateful is to get the absolute best performance by threading a word in registers. Unfortunately I don't know any way to get rid of the Basic-Stateful code duplication in any reasonable way in current Haskell. We don't have sufficient representation polymorphism to do it robustly, and TH and Backpack have awful UX.

dminuoso commented 1 year ago

Apologies for the confusion. I did not mean to remove Stateful, merely that my conceived ParserIO newtype might not need an equivalent for Stateful.

I did explore your idea of threading State# universally through all parsers a bit

https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs

(The main juicy bits are towards the bottom)

Here's my thoughts so far:

One thing that becomes very viable, is adding cabal flag controlled logging or assertions now, since you can simply write

do w <- foo
#ifdef DEBUG_ENABLE
   liftIO (putStrLn ("foo is: " <> show w))
#endif
...
AndrasKovacs commented 1 year ago

That looks pretty good. I like that only the state type is being varied. For better error messages we can just rename the state types to whatever, so that people don't necessarily need to be exposed to the RealWorld,

dminuoso commented 1 year ago

State# RealWorld is a wired-in type. While I have not done an exhaustive search, it appears there's at least demand analysis for imprecise exceptions. If we switch out RealWorld for IO, this will lead to unsound code

https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L582-614 https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L772-807

Take particular note of https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L781-782

If we insist on better diagnostic, we have to newtype the parsers, which essentially requires embedding each individual Parser into ParserST or ParserIO via some lift functions.

I think it's not necessarily needed, this situation will only occur if you try an embed an ParserIO action into something with a ParserST or Parser type, a scenario that does not seem overly likely.

AndrasKovacs commented 1 year ago

What about type family-ing the state:

data Mode = PureMode | IOMode | STMode s

type family State (m :: Mode) where
  State PureMode   = Proxy# ()
  State IOMode     = State# RealWorld
  State (STMode s) = State# s

type Res# m e a =
  (#
    (# State m, a, Addr# #)
  | (# State m #)
  | (# State m, e #)
  #)

newtype Parser m e a = Parser {runParser# :: ForeinPtrContents -> Addr# -> Addr# -> State m -> Res# m e a}

runParser :: Parser PureMode e a -> Result e a
runParserIO :: Parser IOMode e a -> IO (Result e a)
runParserST :: (forall s. Parser (STMode s) e a) -> Result e a
dminuoso commented 1 year ago

With a bit of adaption I wrote this https://github.com/dminuoso/flatparse-state/commit/788d9d813cf6bd2acf9089c263582ce303ac18a2

Notes:

This now produces the following error:

app/Main.hs:174:7: error:
    • Couldn't match type: 'IOMode
                     with: 'PureMode
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
174 | err = thing
    |       ^^^^^

Which reads nice with both the unification error on the token type, as well as with the outer type alias.

dminuoso commented 1 year ago

After pondering some more, I decided to compare the diagnostic with the one with just RealWorld:


app/Main.hs:177:7: error:
    • Couldn't match type ‘RealWorld’ with ‘PureWorld’
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
177 | err = thing
    |       ^^^^

It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce. This could even be problematic, we would need to look into GHC further if State# or Proxy# is specially treated, and whether coercions between them is safe or not. And even if we establish it is, a future GHC update could introduce subtle bugs here.

The diagnostic should always mention the type alias as well, perhaps I did not look closely enough at the time I mentioned it.

In my opinion we should not do this. RealWorld seems fine here.

AndrasKovacs commented 1 year ago

It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce.

We should not need to coerce between state types in nominally safe code. There are some unnecessary coercions in your previous code:

Instead of type families, we can just use type synonyms for state types, that saves us language extensions, and maybe the synonyms will persist in error messages (should be tested).

Generally, having Proxy#, State# RealWorld and State# s as distinct state types gives us the greatest safety and precision w.r.t GHC compilation (if we want to have IO, ST and pure code at the same time).

I don't think it makes sense to have fixed State# RealWorld threading. First, we would still need to distinguish Parser and ParserIO in user-facing API, so why not just make the distinction in state types. Second, while this would be fine for pure and IO parsing (because I doubt that there's any penalty to RealWorld passing in pure parser code), for ST support we'd need to cast between State# RealWorld and State# s, which I don't really like.

dminuoso commented 1 year ago

https://github.com/dminuoso/flatparse-state/commit/bda57a95f8b06e83f9c3a6d0d38b4ea62431ef54

Okay, this change is actually simple to do, we can simply write

#if !MIN_VERSION_base(4,17,0)
type ZeroBitRep = 'TupleRep ('[] :: [RuntimeRep])
type ZeroBitType = TYPE ZeroBitRep
#endif

newtype ParserT (st :: ZeroBitType) e a = ParserT
  { runParserT# :: ForeignPtrContents
                 -> Addr#
                 -> Addr#
                 -> st
                 -> Res# st e a
  }

type IOMode = State# RealWorld
type PureMode = Proxy# ()
type STMode s = State# s

type Parser = ParserT PureMode
type ParserIO = ParserT IOMode
type ParserST s = ParserT (STMode s)

Of course the diagnostic is getting slightly worse now:

app/Main.hs:186:7: error:
    • Couldn't match type: State# RealWorld
                     with: Proxy# ()
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
186 | err = thing
    |       ^^^^^

The type aliases are sadly not kept.

I have removed the unnecessary uses of unsafeCoerce. We can also write runParserST directly in terms of runRW#, but that is definitely not necessarily, given that unsafePerformIO will do just that, plus some lazy which is actually missing from runST (will find the related GHC bug later today)

Semantically this looks great, at the cost of potentially exposing some magic hashes in rare diagnostics. Given that this seems to also produce Expected: Parser () Int Actual: ParserIO () () I'm personally fine with it.

AndrasKovacs commented 1 year ago

This should be proxy# passing: https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs#L131 Looks otherwise pretty good to me.