Deduplicate and standardize deserialization logic for streams

alihan-synnada commented 1 week ago

Which issue does this PR close?

None

Rationale for this change

Part of #13411

This PR implements a common Decoder trait, the BatchDeserializer trait and the DecoderDeserializer struct as described in the issue, along with CsvDecoder and JsonDecoder as arrow-csv and arrow-json Decoders are readily available.

What changes are included in this PR?

Note: There are about 290 lines of new tests, so it is about 250 lines of actual code.

Add BatchDeserializer as a common interface.
- digest consumes the input in chunks
- next attempts to deserialize the digested data and returns a DeserializerOutput which is either a RecordBatch, RequiresMoreData and InputExhausted
- finish signals the end of the input stream
Add Decoder trait
- Mimics arrow-json and arrow-csv's Decoders
Implement Decoder for CsvDecoder and JsonDecoder by forwarding methods
Add DecoderDeserializer and implement BatchDeserializer for formats that have a Decoder implementation.
Add deserialize_stream function to deduplicate the deserialization logic

Are these changes tested?

Yes, the changes are covered by new tests added to the CSV and JSON modules.

Are there any user-facing changes?

No

ozankabak commented 1 week ago

We discussed this with @alihan-synnada and it looks good to me, but it'd be great to get community review. cc @alamb

ozankabak commented 1 week ago

One thing I noticed is that https://github.com/apache/datafusion/issues/13411 talks about the arrow and avro as well. Do you plan to update them in a follow on PR?

Yes, indeed. Not an immediate priority but we would like to tidy up the read side.

Finally, the ticket also mentions parquet -- I think it will be hard to update the parquet reader (or any columnar file format) to use the DecodeTrait. The parquet reader itself drives what IO to do (aka what byte ranges and when) rather than the more row oriented format.

I agree -- Parquet will probably stay separate for the time being.

apache / datafusion