Deltares / hydromt

HydroMT: Automated and reproducible model building and analysis
https://deltares.github.io/hydromt/
MIT License
67 stars 27 forks source link

explore cloud-ready vector formats and readers #66

Closed DirkEilander closed 1 year ago

DirkEilander commented 2 years ago

From first trial: fgb works out of the box with geopandas.read_file() and supports faster (even faster than gpkg) reading of spatial subextents.

Tjalling-dejong commented 1 year ago

I did some testing with geopandas and the fgb and parquet formats.

File structure It is possible to open vector parquet files with geopandas.read_parquet(). However, it is not possible to read spatial subextents through this method. Parquet is a columnar data format that does allow filtering by column. I don't think that this is very relevant for HydroMT. FlatGeobuf does allow filtering based on a subextent (bounding box or geometry).

I/O Speed Parquet does have an advantage over fgb when it comes to writing speed. In the test I did parquet was 18 times faster in writing the data to file compared to fgb. Reading the data as fgb was about twice as fast compared to parquet though.

Integration Since you can use geopandas.read_file() for reading fgb files, integration in HydroMT will be quite straight forward. The fgb file format is inferred from the extension. With parquet you would need to use geopandas.read_parquet(), which does not allow the same kwargs as read_file().

Nice article that describes the current overview of cloud native vector formats: https://cholmes.medium.com/an-overview-of-cloud-native-vector-c223845638e0

The author is of the opinion that the best cloud ready vector format is flatgeobuf at the moment. He does indicate that other formats are in development and GeoParquet has much potential.

DirkEilander commented 1 year ago

Thanks for looking into this @Tjalling-dejong. We are already using this format and it indeed seems to work fine so your research is a confirmation of that. It is also mentioned here as the preferred option for vector data in de deltares_data catalog https://github.com/Deltares/hydromt/tree/main/data/catalogs. I don't this issue requires further action.