Closed RobinLanglois closed 2 years ago
Hey,
I'd say this is a known issue, and is really down to this being a pure-python module.
The standard python JSON parser is partially implemented in C (see the _json
built-in module). This means is does the string processing stuff in C which is much faster.
However, json-stream does allow passing in a custom tokeniser to do the actual stream reading, and there is an extension for json-stream that implements a tokenizer in Rust called json-stream-rs-tokenizer.
See the docs here: https://github.com/daggaz/json-stream#custom-tokenizer
I'd be very interested to know if this helps your situation - perhaps I will integrate it into the core of the module - so please do report back your findings!
I tried and I faced this error :
ValueError: Error while parsing at index 600781272: PyErr { type: <class 'ValueError'>, value: ValueError('number too large to fit in target type'), traceback: None }
Interesting! I'm guessing this is an issue with the rust implementation of the parser.
"number too large to fit in target type" is a rust error if I'm not mistaken.
Perhaps @smheidrich can help us out with this one?
You could also raise an issue on the bug tracker for that project.
Yes that is indeed in json-stream-rs-tokenizer
territory. I've created a ticket (https://github.com/smheidrich/py-json-stream-rs-tokenizer/issues/13) and will look into it, but no guarantees on whether I'll be able to fix it or just document it as a limitation.
Has been fixed in json-stream-rs-tokenizer
version 0.3.0. The only caveat is that it doesn't work with PyPy due to a limitation of PyO3, but as long as you use CPython it should be fine.
Amazing @smheidrich!
@RobinLanglois can you verify this fixes your issue?
Yes, updating json-stream-rs-tokenizer
from pip
solved my issue !
And for the general issue, I'm now processing a 8 GB JSON in ~ 30 minutes, which is convenient.
Thanks to both of you for helping me ! I'm closing this.
Regards :grin:
Glad we could help!
I you would like to support either of us, dontations are always welcome :)
json-stream 2.x
now includes the faster rust tokenizer as standard, with graceful degradation to the slower pure python parser on platforms not supported by json-stream-rs-tokenizer
Hello, I tried this library yesterday and it was quite slow.. So I want to ensure so I didn't misuse it. My processing with
json
standard library ran in approximately 2 minutes for a 2 GB file. But I couldn't put in memory a bigger file. Withjson-stream
, I processed a 8 GB file in approximately 3 hours.My JSON file contains only one field, which is a JSON array. I iterated over it with
json-stream
, but I need to store a certain amount of the results in memory (which are flushed periodically), so I found that I needed the "mixed" approach, and iterate over the array like this :Am I doing this wrong ?