AndrasKovacs / flatparse

Fast parsing from bytestrings
MIT License
146 stars 12 forks source link

Parse bytestring builder (return "write bytes to given address" IO action) #39

Closed raehik closed 1 year ago

raehik commented 1 year ago

I have found a use case (i.e. it seems to be a fairly sensible type class instance somewhere) for parsing to a primitive bytestring builder. It would allow parse->serialize steps to be performed with zero intermediate copying. Perhaps you parse and edit some record, only editing select fields, and you figure the extra intermediate bytestrings (which would be copied from the parse input into their own bytestrings, then copied from again when serializing) are a waste of cycles.

I'm currently thinking of something like this (messy):

-- for a memcpy FFI, but could (should?) redefine
import qualified Data.ByteString.Internal as B

-- returns build action and length
-- can be wrapped into any bytestring builder
takeRestPrim :: ParserT st e (Addr# -> IO (), Int)
takeRestPrim = ParserT \fp eob s st ->
  let n# = minusAddr# eob s
  in  OK# st (\sTo -> memcpy sTo s n#, I# n#) eob
{-# inline takeRestPrim #-}

memcpy :: Addr# -> Addr# -> Int# -> IO ()
memcpy sFrom sTo len# = B.memcpy (Ptr sFrom) (Ptr sTo) (fromIntegral (I# len#))

-- ... and such for the other bytestring parsers like `take`

Might something like this be sensible to add?

AndrasKovacs commented 1 year ago

This takeRestPrim is not memory safe because it saves an Addr# in a closure whose underlying data can get garbage collected subsequently. If we save a ForeignPtr instead, that means that the builder closure saves exactly the same data that is returned by takeRest, namely a ByteString. I don't see why returning a closure would be preferred.

raehik commented 1 year ago

Thank you. I had a feeling I was going wrong somewhere. But as you say, the take family don't do any copying anyway, so this is fairly unhelpful as well as unsafe. I had assumed we copy the bytes out into a new buffer. Instead we directly construct a ByteString using the parser's ForeignPtrContents. Does Haskell somehow know that it may garbage collect parts of the input string, but not the parts that were taken? (Also, any chance you know of resources to learn about garbage collection and memory manipulation in GHC?)

Closing since not useful.

AndrasKovacs commented 1 year ago

A ByteString has a foreign malloc-d array which lives while its ForeignPtrContents lives. Only the whole array can be freed. This implies that any slice of a ByteString keeps the whole original ByteString alive. The touch# and keepAlive# primops from GHC.Prim can be used to ensure that an object lives at a certain point in sequential State#-passing execution.

We use touch# in bytestring to make sure that the backing data lives until its processing is finished. The unsafeUseAsCString in runParser also uses touch# or keepAlive# in the source.

raehik commented 1 year ago

Thanks a bunch for the explanation. Does that mean takeUnsafe#

takeUnsafe# :: Int# -> ParserT st e B.ByteString
takeUnsafe# n# = ParserT \fp eob s st ->
    case n# <=# minusAddr# eob s of
      1# -> OK#   st (B.PS (ForeignPtr s fp) 0 (I# n#)) (plusAddr# s n#)
      _  -> Fail# st

keeps the input bytestring alive while the parsed "slice" is alive? Is this a good idea? Should we also provide copying parsers? (I should note that I wrote these functions, and didn't understand the implication of using the ForeignPtrContents in this way at the time.)

AndrasKovacs commented 1 year ago

Data.ByteString.copy already exists for this purpose.

raehik commented 1 year ago

I see. Thanks for all this. Might I document that these functions will keep the input bytestring alive? I find it unintuitive (and I worry it could lead to unexpected memory usage for some cases?)

AndrasKovacs commented 1 year ago

Sure.