Closed alihan-synnada closed 1 week ago
We discussed this with @alihan-synnada and it looks good to me, but it'd be great to get community review. cc @alamb
One thing I noticed is that https://github.com/apache/datafusion/issues/13411 talks about the arrow and avro as well. Do you plan to update them in a follow on PR?
Yes, indeed. Not an immediate priority but we would like to tidy up the read side.
Finally, the ticket also mentions parquet -- I think it will be hard to update the parquet reader (or any columnar file format) to use the DecodeTrait. The parquet reader itself drives what IO to do (aka what byte ranges and when) rather than the more row oriented format.
I agree -- Parquet will probably stay separate for the time being.
Which issue does this PR close?
None
Rationale for this change
Part of #13411
This PR implements a common
Decoder
trait, theBatchDeserializer
trait and theDecoderDeserializer
struct as described in the issue, along withCsvDecoder
andJsonDecoder
asarrow-csv
andarrow-json
Decoder
s are readily available.What changes are included in this PR?
Note: There are about 290 lines of new tests, so it is about 250 lines of actual code.
BatchDeserializer
as a common interface.digest
consumes the input in chunksnext
attempts to deserialize the digested data and returns aDeserializerOutput
which is either aRecordBatch
,RequiresMoreData
andInputExhausted
finish
signals the end of the input streamDecoder
traitDecoder
sDecoder
forCsvDecoder
andJsonDecoder
by forwarding methodsDecoderDeserializer
and implementBatchDeserializer
for formats that have aDecoder
implementation.deserialize_stream
function to deduplicate the deserialization logicAre these changes tested?
Yes, the changes are covered by new tests added to the CSV and JSON modules.
Are there any user-facing changes?
No