Open scovich opened 1 month ago
Sorry this one managed to slip through, adding num_buffered_rows and has_partial_record seems perfectly reasonable to me
take
Hi, @scovich @tustvold I am currently looking into this.
Do we need to duplicate the changes, which are implemented in https://github.com/delta-io/delta-kernel-rs/pull/373/files
in the arrow-rs, along with num_buffered_rows
has_partial_record
functions ?
Do we need to duplicate the changes, which are implemented in https://github.com/delta-io/delta-kernel-rs/pull/373/files in the arrow-rs, along with num_buffered_rows has_partial_record functions ?
Good question. We're happy to tweak the delta-kernel-rs code to match a new arrow-rust API, as long the new API covers the use case. I tried to factor that out in the "details" sections of this issue description.
If you refer to the parse_json_impl method in that PR, it corresponds to my comment in this issue's description:
It would be even nicer if the
parse_json
method could just become part of either arrow-json or arrow-compute, if parsing strings to JSON is deemed a general operation that deserves its own API call.
Seems like the low-level support can go independently of a decision to expose a new public parse_json
method in arrow-compute or arrow-json?
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have a nullable
StringArray
column that contains JSON object literals.I need to JSON parse the column into a
StructArray
of values following some schema, and NULL input values should become NULL output values.This can almost be implemented using arrow_json::reader::ReaderBuilder::build_decoder and then feeding in the bytes of each string. But the decoder has no concept of record separators in the input stream. Thus, invalid inputs such as blank strings (
""
), or truncated records ("{\"a\":1"
), or multiple objects ("{\"a\": 1} {\"a\": 2}"
) will confuse the decoding process. If we're lucky, it will produce the wrong number of records, but an adversarial input could easily seem to produce the correct number of records even tho no single input string represented a valid JSON object. Thus, if I want such safety, I'm forced to parse each string as its ownRecordBatch
(which can then be validated independently), and then concatenate them all. Ugly, error-prone, and inefficient:Describe the solution you'd like
Ideally, the JSON Decoder could define public methods that say how many buffered rows the decoder has, and whether the decoder is currently at a record boundary or not. This is essentially a side effect-free version the same check that
Tape::finish
already performs whenDecoder::flush
is called:That way, the above implementation becomes a bit simpler and a lot more efficient:
It would be even nicer if the
parse_json
method could just become part of either arrow-json or arrow-compute, if parsing strings to JSON is deemed a general operation that deserves its own API call.Describe alternatives you've considered
Tried shoving each string manually into a
Decoder
to produce a singleRecordBatch
, but the above-mentioned safety issues made it very brittle (wrong row counts, incorrect values, etc). Currently using the ugly/slow solution mentioned earlier, that creates and validates oneRecordBatch
per row, before concatenating them all into a singleRecordBatch
.