apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

Deduplicate and standardize deserialization logic for streams #13412

Closed alihan-synnada closed 1 week ago

alihan-synnada commented 1 week ago

Which issue does this PR close?

None

Rationale for this change

Part of #13411

This PR implements a common Decoder trait, the BatchDeserializer trait and the DecoderDeserializer struct as described in the issue, along with CsvDecoder and JsonDecoder as arrow-csv and arrow-json Decoders are readily available.

What changes are included in this PR?

Note: There are about 290 lines of new tests, so it is about 250 lines of actual code.

Are these changes tested?

Yes, the changes are covered by new tests added to the CSV and JSON modules.

Are there any user-facing changes?

No

ozankabak commented 1 week ago

We discussed this with @alihan-synnada and it looks good to me, but it'd be great to get community review. cc @alamb

ozankabak commented 1 week ago

One thing I noticed is that https://github.com/apache/datafusion/issues/13411 talks about the arrow and avro as well. Do you plan to update them in a follow on PR?

Yes, indeed. Not an immediate priority but we would like to tidy up the read side.

Finally, the ticket also mentions parquet -- I think it will be hard to update the parquet reader (or any columnar file format) to use the DecodeTrait. The parquet reader itself drives what IO to do (aka what byte ranges and when) rather than the more row oriented format.

I agree -- Parquet will probably stay separate for the time being.