apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Spatial data support #7859

Open Folyd opened 10 months ago

Folyd commented 10 months ago

Is your feature request related to a problem or challenge?

Currently, datafusion does not support spatial data, any plan to support this?

Describe the solution you'd like

Similar to duckdb: https://duckdb.org/docs/extensions/spatial.html

Describe alternatives you've considered

Duckdb

Additional context

https://cloud.google.com/bigquery/docs/geospatial-data

alamb commented 10 months ago

There are no plans that i know of yet.

In theory, we should be able to create an extension package, much like the duckdb model, rather than extending the core DataFusion engine.

I suspect there would be certain things that are not yet feasible (like adding a GEOMETRY type / alias for example) but otherwise the existiing extension points for DataFusion should be sufficient (ScalarUDFs, AggregateUDFs, etc)

Perhaps we can do something similar for the JSON/BSON support we are discussing in #7845

Folyd commented 10 months ago

Thanks @alamb.

Here is my example of handling geometry parquet data: https://github.com/apache/arrow-rs/issues/4945 There are Geoparquet format in the community: https://geoparquet.org/ Also see: https://getindata.com/blog/introducing-geoparquet-data-format/

wjones127 commented 10 months ago

Since it hasn't been mentioned yet, I'd add there is already a project for Arrow extension types for geospatial data: https://github.com/geoarrow/geoarrow/blob/main/extension-types.md

This is related to the GeoParquet project.

alamb commented 10 months ago

Thanks @wjones127 -- I had forgotten about extension types.

Maybe we could add support for extension types in DataFusion's core and use that extension point to implement a geospatial package on top of DataFusion 🤔

Having a good first use case (Geospatial and possible JSON) to drive the requirements seems like a good idea.

If you agree, I can try and write up a larger project description

yukkit commented 10 months ago

@alamb I have the same requirement as well and hope to initiate it as soon as possible. If possible, I can also contribute code for this.

alamb commented 10 months ago

@alamb I have the same requirement as well and hope to initiate it as soon as possible. If possible, I can also contribute code for this.

That is great news @yukkit -- I don't think I have the bandwidth to try and organize an effort to add Geospatial support to DataFusion in the near term. I wonder if anyone has the bandwidth to help organize an effort to add extension type support? I don't know enough about how this works to really do so without additional research, and sadly I don't have the time at the moment to devote there

yukkit commented 10 months ago

@alamb Ok, if possible, I plan to support UDT (user-defined type) in datafusion, I will paste my ideas in the next few days for anyone to discuss

alamb commented 10 months ago

I would love to see a design proposal for user defined types. ❤️ -- thank you!

yukkit commented 10 months ago

I would love to see a design proposal for user defined types. ❤️ -- thank you!

Of course, it's absolutely essential!

kylebarron commented 10 months ago

My goal is to enable spatial support in projects such as datafusion via https://github.com/geoarrow/geoarrow-rs

kylebarron commented 1 month ago

I'd argue that spatial data support is pretty much blocked until datafusion has support for user-defined types, since it's pretty crucial to pass along GeoArrow metadata, so it's really exciting to see https://github.com/apache/datafusion/issues/11513 / https://github.com/apache/datafusion/pull/11160 !