daggaz / json-stream

Simple streaming JSON parser and encoder.
MIT License
122 stars 18 forks source link

added buffered reading to tokenizer #46

Open daggaz opened 1 year ago

daggaz commented 1 year ago

In response to #45

smheidrich commented 1 year ago

As you already mentioned in https://github.com/daggaz/json-stream/issues/45#issuecomment-1594575818, this partly intersects with the changes that would be needed to fix #30. To be more precise, not doing any buffering is the most simple (but also least performant, at least in Rust) way of fixing #30, as I mentioned here, and my suggested fix for #30 (https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/50) uses it when cursor positions in sync with the tokenization progress are requested but the underlying Python stream isn't seekable.

So for the Rust tokenizer I'm currently thinking about merging https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/50 first (except with the new constructor parameter correct_cursor introduced there as keyword-only) because it would facilitate both the introduction of user-defined buffering like here and the fix for #30. But it would also make the two tokenizers' code structures diverge a bit more, as buffering in https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/50 is handled in a separate struct instead of directly as part of the tokenizer. I guess some divergence has to be expected anyway, though, so not sure how bad this would be.

daggaz commented 1 year ago

@smheidrich So, I changed the github build process to run the whole test suite both with and without the rust tokenizer, and I discovered a hidden bug in the python tokenizer where it wasn't completing the state machine if the buffer was empty and the last state didn't advance the stream.

This was fixed in this commit.

If you've already ported this code, you will also have ported this bug!

daggaz commented 1 year ago

I have also, in response to your comment in the rust repo committed a proposed new interface for rust_tokenizer_or_raise()

smheidrich commented 1 year ago

All right so I've finally gotten around to writing the parallel PR to this for the Rust tokenizer: https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/87

I tested it locally with the test case you modified here and there are no errors so I guess it basically works? Although there are a lot of different cases depending on e.g. whether the underlying Python stream returns strings or bytes, whether it's seekable, etc., so I might write another test on my end for those.

UPDATE: Tests on my side are done now as well.