AndrasKovacs / flatparse

Fast parsing from bytestrings
MIT License
145 stars 11 forks source link

Provide buffer parser i.e. `runParserPtr :: ParserIO e a -> Ptr Word8 -> Int -> _` #57

Open raehik opened 6 months ago

raehik commented 6 months ago

Brief preamble. I'm using flatparse for some binary file parsing. I'm parsing a filetable, where I know the precise length. I'm operating on Ptrs because I want the core to be source-agnostic (whether working on bytestrings directly, or file handles etc.). I could copy the bytes into a bytestring and parse, but I figure, why don't I parse an address directly? To my (limited) knowledge, it would seem sensible, assuming the lifetime is handled externally (e.g. using withForeignPtr).

Assuming good intentions (no lying about pointers or their length), would a runParserPtr :: ParserIO e a -> Ptr Word8 -> Int -> _ be useful and safe? With lots of warnings on pitfalls of course.

I can provide some example code if useful.

raehik commented 6 months ago

Something I note is that since flatparse carries around a ForeignPtrContents in its internal state, we have to come up with one in such cases. FinalPtr :: ForeignPtrContents exists and seems to be intended for such cases (though I've not seen or used it before).

raehik commented 5 months ago

The problem I've come against here is that Result e a has OK a B.ByteString. runParserPtr can't return a ByteString because it doesn't have "memory rights" to the bytes it's parsing. We can only return an Int indicating the number of bytes left unparsed.

On a related note, what is the reason for ferrying a ForeignPtrContents along in the parser type? ~Flicking through code, I don't see it used in parser combinators.~ Can we not use something like withForeignPtr in the parser runner to keep the inner Ptr/Addr# in scope? Then I could imagine adapting Result to enable returning an Int in these such cases.

Edit: I see that the ForeignPtrContents is used in some combinators to enable non-copying bytestring parsing e.g. byteStringOf. So any lying to GHC (like attaching a FinalPtr empty finalizer to an arbitrary Ptr) would be very unsafe. I'll just copy my Ptr Word8 to a fresh ByteString before parsing.

I might try writing a buffer parser parallel to flatparse but without carrying around the ForeignPtrContents.

AndrasKovacs commented 5 months ago

IIRC I added ForeignPtrContents specifically to enable ByteString creation. I found it convenient and observed that it's reliably eliminated by GHC when it's not used.

raehik commented 5 months ago

As I understand it's convenient because all the ByteStrings we generate where we use the ForeignPtrContents are just references to parts of our input, rather than being copied. I think that's usually OK (though could be a source of memory leakage?). But in the case where we're parsing a Ptr Word8, where all we know is 1) it's in scope and 2) it's readable, we can't do that as we don't have any control over the input's lifecycle. (I'm new to low-level Haskell and I might be wrong about this.)

It's not a complaint with the library since we advertise it as a ByteString parser, not a buffer one. But it's an interesting little hidden detail in the library design.

AndrasKovacs commented 5 months ago

How can it happen that we have a Ptr and don't know anything about its provenance? I would say that any such situation is a bug. And if we know where the Ptr comes from we can create a finalizer for it.

raehik commented 5 months ago

What about parsing from a buffer provided by allocaBytes :: Int -> (Ptr a -> IO b) -> IO b? I would be happy to use FinalPtr for the finalizer. The problem is that some of our combinators use the finalizer to create "free" bytestrings (without copying). If I understand correctly, that wouldn't work in this case.

AndrasKovacs commented 5 months ago

It would work with as much safety guarantee as allocaBytes in general, where we shouldn't let the Ptr escape, and there's no runtime or compile time check to enforce this. I can imagine situations where it's convenient to pack FinalPtr in bytestrings and use them while we're still under the alloca. Users who are parsing from alloca-d Ptr already do manual memory management, so why not let them manually scope bytestrings.

That said, if we add runParserPtr, the docs for the bytestring operations should mention the lifetime issues and there should be also copying variants of bytestring creation for convenience.

raehik commented 5 months ago

I can imagine situations where it's convenient to pack FinalPtr in bytestrings and use them while we're still under the alloca.

You're right, now I understand that I'm very happy with the internal parser design. Thanks for your patience and explanations :) Could I start a PR adding runParserPtr? Should it take a ForeignPtrContents, or should we always fill in with FinalPtr?

AndrasKovacs commented 5 months ago

Btw, is there a standard function that turns Ptr into an "unsafe" bytestring with FinalPtr? If there's such thing I don't necessarily see need for runParserPtr.

raehik commented 5 months ago

I can't find any Ptr a -> Int -> ByteString on Hoogle. I realize now I could pack a fake bytestring without much work. But it feels like a convenient place to document lifetime issues for anyone trying to do similar things. And I like to avoid importing Data.ByteString.Internal in my code where possible.

AndrasKovacs commented 5 months ago

I guess we can just go with runParserPtr then.