eqasim-org / ile-de-france

An open synthetic population of Île-de-France for agent-based transport simulation
GNU General Public License v2.0
47 stars 69 forks source link

feat(output): export all outputs in parquet files #238

Closed vincent-leblond closed 4 months ago

vincent-leblond commented 4 months ago

Add export for all outputs in parquet files. All existing exports are kept. New parquet files are added. Parquet files use .parquet extension. And geospatial files use .geoparquet extension. Pyarrow package dependency is added.

Parquet format is extremely faster to read than GeoPackage files and csv. For example on trips data on 10% population for departement 14, it takes 25 seconds to read in GeoPackage, and only 0.17 seconds in Parquet.

sebhoerl commented 4 months ago

Hi Vincent, thanks for the PR. It looks like the dependencies are not consistent in the environment.yml (check failed unit test). I can find some time for it, but if you could figure it out that would be great:

Second point, I think it would be great if this was configurable. I imagine something like:

config:
  [...]
  output_formats: ["csv", "parquet", "gpkg", "geoparquet"]

And it is woudl be set to ["csv", "gpkg"] by default for now. To be (1) backwards compatible and (2) allow users to completely switch to parquet if they want, but don't write duplicate outputs. So basically, this would just mean having an if around every output command depending on what is in the list. Would be great if you can take a look at this, otherwise I can also find some time.

vincent-leblond commented 4 months ago

Hi Sebastian, I will follow your advice and then come back to you.

vincent-leblond commented 4 months ago

Changing pyarrow version to an older one seems to be enough.

sebhoerl commented 4 months ago

Looks good, thanks a lot :)