TkTech / pysimdjson

Python bindings for the simdjson project.
https://pysimdjson.tkte.ch
Other
643 stars 54 forks source link

This parser can't support a document that big #70

Closed dclong closed 3 years ago

dclong commented 3 years ago

I encounter the following issue when parsing a huge (>10G) JSON file.

image

lemire commented 3 years ago

How big is report.json?

dclong commented 3 years ago

About 13G.

lemire commented 3 years ago

The underlying library will refuse to parse JSON documents larger than 4 GB. It will support large inputs, but only if they are made of a stream of JSON documents (e.g., ndjson).

Ingesting a single 13 GB document all at once in a DOM tree is a performance and interoperability anti-pattern. I recommend against it.

I cannot speak pysimdjson but I expect that it works as is expected.

TkTech commented 3 years ago

As Lemire said, this isn't supported in the underlying library. When/if simdjson gets streaming support, we'll definitely implement it.