dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.39k stars 156 forks source link

Support spatial / geo types #696

Open quassy opened 11 months ago

quassy commented 11 months ago

Feature description

dlt should support geo types like shapes, geometries & geographies and different CRS as best as possible to allow data loading/transfer of such data in spatial databases like Postgis (Postgres addon), SpatiaLite (SQLite addon), BigQuery (only 2D geographies), H2, Oracle Spatial...

Are you a dlt user?

I'd consider using dlt, but it's lacking a feature I need.

Use case

Natively, Postgres supports basic geometries and Postgis adds support for georeferenced geometries/geographies with different CRS (coordinate reference systems) & even 3D. BigQuery supports 2D geographies and only in CRS WGS64/EPSG:4326. As different database systems (and coordinate reference systems) fit different usages, data is often transfered and transformed between systems for certain workloads. Loading spatial data is quite cumbersome because bad support from common libraries, different coordinate systems and expensive operations on geometries.

Proposed solution

Support spatial data. In some directions data has to be converted to certain CRS or to geojson/WKT because the target system might not support it otherwise. These conversions can be lossy for example by small changes due to transformation and rounding of coordinates. Also different DBMS enforce validation ((counter-)clockwise polygons, self-intersections, touching points) differently, so sometimes not all geometries might be transferable.

Related issues

No response

rudolfix commented 11 months ago

@quassy it is quite easy to add a new data type that will load data into GEOGRAPHY column if destination supported that. I'm not sure it will really solve your problem. From your description it looks like the real problem is on Python side where there's no good lib to represent geospatial data, convert across different coordinates etc. What we could do from our side is to add geography data type that accepts a string or base64 encoded binary representation (if there's any standard representation of those as Python objects we can also recognize those) and tries to load it into the destination. The whole task of formulating data in right format for given destination would be on the user side (we OFC can add helpers). Would you start with bigquery or postgres? what is more popular? What are typical use cases? Is this data coming from automated systems and need to be loaded? or rather it is a result of computation (ie. in a Notebook) that needs to be later loaded?

quassy commented 11 months ago

Python has libraries to do that like geopandas, osgeo (GDAL), pyproj, shapely, geoparquet... They work for small datasets but for EL pipelines they are often not performant enough. I'm not sure how dlt works on the inside but geoparquet, binary (WKB) or geojson might be the best ways to represent spatial data.

I would start with PostGIS as it's FLOSS, generally more popular and supports more geo types. You can self host it, so it's also much easier to start developing and do tests. (BigQuery just scales much better.)

Use cases are wide and I can only speak from experience in a small area. An example would be working with openly available geodata like the EU Inspire datasets, like conservation areas in Ireland. Using the above libraries the data can be loaded to PostGIS. But then there are tasks were OLAP databases (like BigQuery) are better suited, others where OLTP (like PostGIS) are better and for others even flat file storage (as geojson, shp, WKT...), so you end up transferring data back and forth between sytems.

Pipboyguy commented 9 months ago

I suggest we push as much operations to the target warehouse as possible. Libraries like gdal require library headers to be installed before a pip will work, and the other libraries are also very bloated.

A simple in-memory representation like geojson (or geoarrow) should in theory suffice. support all sources, and even load to destination as jsontype as well. In the spirit of not dealing with the T in ELT, i don't think it's appropriate to do complex conversions, CSR projections etc in dlt that require extensive specialist knowledge and depends too much on individual use case.

adrianbr commented 6 months ago

Re duplicate: https://github.com/dlt-hub/dlt/issues/1101 from slack: Sepp 3 hours ago Re Use Case: Our data is mostly georeferenced. Say for instance a "project". A project for us always represents an area in the real world (it could be a district or any abritrary polygon shape). Inside a project are "adresses" which also represent our clients, as we build infrastructure to our client's adresses. So we have different infrastructure tables ("layers"), addresses, projects, streets etc. pp. All the tables store at least one geometry column in a PostGIS database. And because we do several spatial transformations and checks with this data (e.g. "Does the address point lay inside its project's polygon?") without support of spatial data using dlt would not be sufficient for us. How would you describe "general support"?

Sepp 3 hours ago Sometimes we will also load geojson, shapefiles or other spatial data from a filesystem into our data warehouse for the case, when our client doesn't use a database we can connect directly to

Sepp 3 hours ago And as pipboyguy mentions in the last comment of the issue you send me just having a geojson that is passed to the destination is sufficent. It just needs to be ensured, that thing like coordinate system and data accuracy is not lost in the process.

Sepp 2 hours ago For the existing postgres source / destination it would be good have geoalchemy2 connected as it was exactly made for the purpose of reading from and writing into postgis database

SeppBerkner commented 2 months ago

Just to raise awareness for this topic again:

gregorywaynepower commented 1 month ago

Our use-case would be exporting spatial data from MS SQL Server, Postgres with the PostGIS extension, shapefiles, or geoparquet/geoarrow types. The challenge is that some data formats, especially the old-school ones don't scale well and require batching.

rudolfix commented 6 days ago

we'd like to start with the simplest possible case:

gregorywaynepower commented 6 days ago

I can understand wanting to only support PostGIS--it's the easiest to pick apart and there can be some overlap with DuckDB's spatial extension if you go that route.