apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.29k stars 973 forks source link

Remove `file_type()` from `FileFormat` #10499

Closed Jefffrey closed 2 weeks ago

Jefffrey commented 2 weeks ago

Which issue does this PR close?

Closes #8657

Rationale for this change

Was implementing rudimentary ORC integration for DataFusion:

https://github.com/datafusion-contrib/datafusion-orc/blob/main/src/datafusion/file_format.rs

Noticed that to implement FileFormat it requires a file_type(), even though this is not really used anywhere. Removing this can make it clearer for those wanting to extend with more custom file formats.

What changes are included in this PR?

Remove FileFormat::file_type() which was unused.

Are these changes tested?

Are there any user-facing changes?

Yes, function removed from trait

Jefffrey commented 2 weeks ago

There are some related issues that I think might be good to close now? #8345 and #8637

I was able to implement basic read support for ORC files via ListingTable provider:

https://github.com/datafusion-contrib/datafusion-orc/blob/main/src/datafusion/mod.rs

Which might be enough to demonstrate those issues are done?

alamb commented 2 weeks ago

Makes sense to me -- thank you @Jefffrey

As you go through implementing ORC support, if you hit anything else that woudl make it easier to add new format support to the core and/or listing table that would be great.

alamb commented 2 weeks ago

strictly speaking this is an API change but I think it makes it easier to implement new formats 🚀

alamb commented 2 weeks ago

There are some related issues that I think might be good to close now? #8345 and #8637

I was able to implement basic read support for ORC files via ListingTable provider:

https://github.com/datafusion-contrib/datafusion-orc/blob/main/src/datafusion/mod.rs

Which might be enough to demonstrate those issues are done?

Indeed -- it sounds reasonable

In my mind, the ideal example is to make an example in datafusion-examples showing how to create a custom file format

I wonder if we could create a custom CSV reader that ignores comments 🤔 -- for example https://github.com/apache/datafusion/issues/10262 that @bbannier has been working on

Jefffrey commented 2 weeks ago

As you go through implementing ORC support, if you hit anything else that woudl make it easier to add new format support to the core and/or listing table that would be great.

Will definitely keep this in mind :+1:

alamb commented 2 weeks ago

Thanks again @Jefffrey