Closed vincent-leblond closed 4 months ago
Hi Vincent, thanks for the PR. It looks like the dependencies are not consistent in the environment.yml (check failed unit test). I can find some time for it, but if you could figure it out that would be great:
develop
environment.ymlpyarrow
in whatever version is proposed by conda, this may also upgrade some other dependenciesconda env export --no-builds
and check the version of pyarrow and all other dependencies that are in environment.yml
and note their (potentially updated) versions in the new environment.yml
Second point, I think it would be great if this was configurable. I imagine something like:
config:
[...]
output_formats: ["csv", "parquet", "gpkg", "geoparquet"]
And it is woudl be set to ["csv", "gpkg"]
by default for now. To be (1) backwards compatible and (2) allow users to completely switch to parquet if they want, but don't write duplicate outputs. So basically, this would just mean having an if
around every output command depending on what is in the list. Would be great if you can take a look at this, otherwise I can also find some time.
Hi Sebastian, I will follow your advice and then come back to you.
Changing pyarrow version to an older one seems to be enough.
Looks good, thanks a lot :)
Add export for all outputs in parquet files. All existing exports are kept. New parquet files are added. Parquet files use .parquet extension. And geospatial files use .geoparquet extension. Pyarrow package dependency is added.
Parquet format is extremely faster to read than GeoPackage files and csv. For example on trips data on 10% population for departement 14, it takes 25 seconds to read in GeoPackage, and only 0.17 seconds in Parquet.