apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.34k stars 1.2k forks source link

Reading Avro files supports other types #7828

Open Asura7969 opened 1 year ago

Asura7969 commented 1 year ago

Is your feature request related to a problem or challenge?

I am now integrating incubator-paimon(it is a streaming data lake platform), when reading the avro file, an exception message will appear: expected avro schema to be a record, because AvroArrowArrayReader only supports AvroSchema::Record, but the avro file format of paimon is Union type(code here

Describe the solution you'd like

It would be better if the parsing format could be implemented by the user. The default implementation is still the current way, no problem.

Describe alternatives you've considered

My current solution: here and this

Additional context

No response

alamb commented 1 year ago

I think there have been some improvements in the avro support recently (in datafusion 33 I think)

https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+avro+is%3Aclosed

For example, https://github.com/apache/arrow-datafusion/pull/7663 looks pretty similar to his request. There was also some talk about adding avro support upstream in arrow: https://github.com/apache/arrow-rs/issues/4886

Perhaps @sarutak has some wisdom to add here

sarutak commented 1 year ago

Thank you for letting me know! I'm interested in this topic. For now, I don't have enough time to look into but will do it this weekend.

sarutak commented 1 year ago

@Asura7969 Sorry for the late confirmation. I understand the problem you are talking about is DataFusion doesn't support nullable top-level records.

@alamb I'd like to discuss this problem. Avro allows top-level records nullable. So, the following records can be allowed in Avro.

{"x": "abc", "y": 100}
{"x": "def", "y": 200}
null
{"x": "ghi", "y": 300}

Notice that the third record is not {"x": null, "y": null} but the record itself is null. I think we would have the following options to treat the nullable top-level records.

  1. Allow nullable top-level records but skip null records
  2. Disallow nullable top-level records. Nullable top-level records should be converted to non-nullable beforehand.
  3. Introduce a configuration option which controls whether nullable top-level records is allowed or not.

BTW, Apache Spark has a similar feature that creates a table from an Avro records but it doesn't currently support nullable top-level nullable.

alamb commented 1 year ago

Hi @sarutak -- @tustvold mentioned something similar today as part of his work on https://github.com/apache/arrow-rs/issues/4886. I think Arrow also supports top level nulls in StructArray so in my opinion another option might be:

4.Allow top level null records and represent them as a Null entry in StructArray.

tustvold commented 1 year ago

This is in some ways a special case of the more general issue of supporting avro schema that don't contain a Record as the root. I recently did something similar for arrow_json in https://github.com/apache/arrow-rs/pull/4911 which is likely the approach I am going to take in https://github.com/apache/arrow-rs/issues/4886