feat(output): export all outputs in parquet files

vincent-leblond commented 4 months ago

Add export for all outputs in parquet files. All existing exports are kept. New parquet files are added. Parquet files use .parquet extension. And geospatial files use .geoparquet extension. Pyarrow package dependency is added.

Parquet format is extremely faster to read than GeoPackage files and csv. For example on trips data on 10% population for departement 14, it takes 25 seconds to read in GeoPackage, and only 0.17 seconds in Parquet.

sebhoerl commented 4 months ago

Hi Vincent, thanks for the PR. It looks like the dependencies are not consistent in the environment.yml (check failed unit test). I can find some time for it, but if you could figure it out that would be great:

Create a fresh environment using the develop environment.yml
Install pyarrow in whatever version is proposed by conda, this may also upgrade some other dependencies
Use a tool of your choice or simply conda env export --no-builds and check the version of pyarrow and all other dependencies that are in environment.yml and note their (potentially updated) versions in the new environment.yml

Second point, I think it would be great if this was configurable. I imagine something like:

config:
  [...]
  output_formats: ["csv", "parquet", "gpkg", "geoparquet"]

And it is woudl be set to ["csv", "gpkg"] by default for now. To be (1) backwards compatible and (2) allow users to completely switch to parquet if they want, but don't write duplicate outputs. So basically, this would just mean having an if around every output command depending on what is in the list. Would be great if you can take a look at this, otherwise I can also find some time.

vincent-leblond commented 4 months ago

Hi Sebastian, I will follow your advice and then come back to you.

vincent-leblond commented 4 months ago

Changing pyarrow version to an older one seems to be enough.

sebhoerl commented 4 months ago

Looks good, thanks a lot :)

eqasim-org / ile-de-france

feat(output): export all outputs in parquet files #238