exyi / pg2parquet

Export PostgreSQL table or query into Parquet file
Apache License 2.0
56 stars 10 forks source link

Handle geo types #11

Open mhkeller opened 8 months ago

mhkeller commented 8 months ago

I'd like to use this to query a PostGIS database that has geometry types. I see there is a GeoParquet format although I haven't looked too deeply into how well these types match one another. Alternatively, serializing geometry types as strings would also be fine for now.

mhkeller commented 8 months ago

Revising this to just Handle instead of Support since I think the two most useful options for my use case don't actually require converting into geotypes.

mhkeller commented 8 months ago

I can close this for now since I found a good workaround. Instead of using select * from my_table, I'm naming each column and for geometry fields, changing the query selection to SUBSTRING(ST_AsText(${column_name})::text, 1, 1000) as ${column_name}, which has the advantage in my use case of not returning the entire geom string.

exyi commented 8 months ago

Sorry, I'll reopen the issue :)

I like the proposal to add options for ignoring unknown types, or converting them to text. It has the problem of backwards compatibility - when I would add support for a new type, your existing workflow can break since type of some column have changed. I guess it's not a big deal if we don't make it the default, and add warning to the documentation.

I think it is a good idea to support the GeoParquet extension, although it looks quite complex, so no promises it will happen anytime soon. I already wanted to add support for points, paths and polygons and serialize them as parquet structs, but postgres only tells me the column type is geometry, not geometry(point) :/ GeoParquet can apparently serialize all of these objects into one BYTE_ARRAY field, so it should be possible to implement.

mahmut-spark commented 2 months ago

I just came here to ask for the same :) Thank you for your efforts. It would be nice to have a flag to cast unknown types as string.

exyi commented 2 months ago

@mahmut-spark Great! Are you looking specifically for postgis geography types or other currently unsupported data types? I have recently found a way to extract the exact column type (i.e. geography(point), not just geography), which opens the possibility to serialize geo types as Parquet structs (point as {x: float, y: float} struct, polygons as lists, ...).

What kind of output schema would you prefer? The possibilities I'm aware of are:

Unfortunately, GeoParquet (currently) only supports geography types at the root of the schema. PostGIS definitely allows you to have them in custom types and arrays, meaning that we might have to produce an invalid GeoParquet if the database uses this.

mhkeller commented 2 months ago

I’m what curious what you found to get the specific geometry type…

mahmut-spark commented 2 months ago

My use case is Postgis -> Parquet -> DuckDB. I think it will be fine even if the geometry values are cast to text. That being said, I gathered these. Hopefully, it will helpful:

PostGIS definitely allows you to have them in custom types and arrays

I couldn't find any information on custom geometry types. But arrays yes, and I confirmed from pgadmin as well:

image

And geoparquet seems to expect the format as: https://geoparquet.org/releases/v1.0.0/

Nesting: Geometry columns MUST be at the root of the schema. A geometry MUST NOT be a group field or nested in a group. In practice, this means that when writing to GeoParquet from another format, geometries cannot be contained in complex or nested types such as structs, lists, arrays, or map types.

Given these circumstances, I suppose it makes sense to (if possible) use GeoParquet:

  1. Write compatible geometry types as WKB.
  2. Write incompatible geometry types cast as TEXT.

What do you think?

exyi commented 2 months ago

@mhkeller I’m what curious what you found to get the specific geometry type…

The information is available in the system catalog pg_attribute. This query will list all attributes with the full types: select a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) FROM pg_catalog.pg_attribute a, you can obviously add filters for specific relations, it also has information about nullability. I believe this is the way to go for other things than geo types — knowing numeric scale and precision is also great, and nullability is nice to have. Currently, all fields are optional in the generated Parquet files; sometimes it doesn't matter, sometimes it's annoying to work with (hello Julia DataFrames).

However, it's bit more complicated to get work with arbitrary queries, but the basic info should be extractable from EXPLAIN (VERBOSE, FORMAT json) {userQuery}.

Since it isn't really needed for WKB/WKT output, I this we could have geo support first. I guess the other main blocker is being able to reliably test against postgres with extensions :sweat_smile: , I'll have a look at setting up proper CI

@mahmut-spark ...

Thank you for the detailed info! I'm still not sure about the correct defaults, but I find it important to support both WKB, WKT and maybe the structs in future, so I'll probably just add an option and make it required for now.

basaran commented 2 months ago

Hi, this is my personal git. If it will be helpful I can put together a docker compose for you to extend postgis. Please let me know.