JSON streaming quite slow

daggaz / json-stream

Simple streaming JSON parser and encoder.

MIT License

122 stars 18 forks source link

JSON streaming quite slow #24

Closed RobinLanglois closed 2 years ago

RobinLanglois commented 2 years ago

Hello, I tried this library yesterday and it was quite slow.. So I want to ensure so I didn't misuse it. My processing with json standard library ran in approximately 2 minutes for a 2 GB file. But I couldn't put in memory a bigger file. With json-stream, I processed a 8 GB file in approximately 3 hours.

My JSON file contains only one field, which is a JSON array. I iterated over it with json-stream, but I need to store a certain amount of the results in memory (which are flushed periodically), so I found that I needed the "mixed" approach, and iterate over the array like this :

    with open(filename) as json_file:
        data_json = json_stream.load(json_file)
        messages = data_json["messages"]
        for message in messages.persistent():

Am I doing this wrong ?

daggaz commented 2 years ago

Hey,

I'd say this is a known issue, and is really down to this being a pure-python module.

The standard python JSON parser is partially implemented in C (see the _json built-in module). This means is does the string processing stuff in C which is much faster.

However, json-stream does allow passing in a custom tokeniser to do the actual stream reading, and there is an extension for json-stream that implements a tokenizer in Rust called json-stream-rs-tokenizer.

See the docs here: https://github.com/daggaz/json-stream#custom-tokenizer

I'd be very interested to know if this helps your situation - perhaps I will integrate it into the core of the module - so please do report back your findings!

RobinLanglois commented 2 years ago

I tried and I faced this error : ValueError: Error while parsing at index 600781272: PyErr { type: <class 'ValueError'>, value: ValueError('number too large to fit in target type'), traceback: None }

daggaz commented 2 years ago

Interesting! I'm guessing this is an issue with the rust implementation of the parser.

"number too large to fit in target type" is a rust error if I'm not mistaken.

Perhaps @smheidrich can help us out with this one?

You could also raise an issue on the bug tracker for that project.

smheidrich commented 2 years ago

Yes that is indeed in json-stream-rs-tokenizer territory. I've created a ticket (https://github.com/smheidrich/py-json-stream-rs-tokenizer/issues/13) and will look into it, but no guarantees on whether I'll be able to fix it or just document it as a limitation.

smheidrich commented 2 years ago

Has been fixed in json-stream-rs-tokenizer version 0.3.0. The only caveat is that it doesn't work with PyPy due to a limitation of PyO3, but as long as you use CPython it should be fine.

daggaz commented 2 years ago

Amazing @smheidrich!

@RobinLanglois can you verify this fixes your issue?

RobinLanglois commented 2 years ago

Yes, updating json-stream-rs-tokenizer from pip solved my issue ! And for the general issue, I'm now processing a 8 GB JSON in ~ 30 minutes, which is convenient. Thanks to both of you for helping me ! I'm closing this. Regards :grin:

daggaz commented 2 years ago

Glad we could help!

I you would like to support either of us, dontations are always welcome :)

daggaz commented 2 years ago

json-stream 2.x now includes the faster rust tokenizer as standard, with graceful degradation to the slower pure python parser on platforms not supported by json-stream-rs-tokenizer