Open Asura7969 opened 1 year ago
I think there have been some improvements in the avro support recently (in datafusion 33 I think)
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+avro+is%3Aclosed
For example, https://github.com/apache/arrow-datafusion/pull/7663 looks pretty similar to his request. There was also some talk about adding avro support upstream in arrow: https://github.com/apache/arrow-rs/issues/4886
Perhaps @sarutak has some wisdom to add here
Thank you for letting me know! I'm interested in this topic. For now, I don't have enough time to look into but will do it this weekend.
@Asura7969 Sorry for the late confirmation. I understand the problem you are talking about is DataFusion doesn't support nullable top-level records.
@alamb I'd like to discuss this problem. Avro allows top-level records nullable. So, the following records can be allowed in Avro.
{"x": "abc", "y": 100}
{"x": "def", "y": 200}
null
{"x": "ghi", "y": 300}
Notice that the third record is not {"x": null, "y": null}
but the record itself is null
.
I think we would have the following options to treat the nullable top-level records.
BTW, Apache Spark has a similar feature that creates a table from an Avro records but it doesn't currently support nullable top-level nullable.
Hi @sarutak -- @tustvold mentioned something similar today as part of his work on https://github.com/apache/arrow-rs/issues/4886. I think Arrow also supports top level nulls in StructArray
so in my opinion another option might be:
4.Allow top level null records and represent them as a Null
entry in StructArray.
This is in some ways a special case of the more general issue of supporting avro schema that don't contain a Record as the root. I recently did something similar for arrow_json in https://github.com/apache/arrow-rs/pull/4911 which is likely the approach I am going to take in https://github.com/apache/arrow-rs/issues/4886
Is your feature request related to a problem or challenge?
I am now integrating incubator-paimon(it is a streaming data lake platform), when reading the avro file, an exception message will appear: expected avro schema to be a record, because
AvroArrowArrayReader
only supportsAvroSchema::Record
, but the avro file format of paimon is Union type(code here)Describe the solution you'd like
It would be better if the parsing format could be implemented by the user. The default implementation is still the current way, no problem.
Describe alternatives you've considered
My current solution: here and this
Additional context
No response