esciencecenter-digital-skills / geospatial-python

Introduction to Geospatial Raster and Vector Data with Python
https://esciencecenter-digital-skills.github.io/geospatial-python/
Other
3 stars 0 forks source link

Loading the crop fields vector data is too slow #32

Closed rogerkuou closed 1 year ago

fnattino commented 1 year ago

geopandas.read_file accepts the arguments bbox,mask, androws` to load only portion of the data into memory. Not sure whether any of these options speeds up the loading of the dataset (one might still need to loop over all features before filtering) - but we could check whether at least one of these arguments helps. At least they should reduce the memory requirements..

fnattino commented 1 year ago

read_file docs

rogerkuou commented 1 year ago

We can try to make the participants download the data before hand, and load it partly with e.g. bbox

rogerkuou commented 1 year ago

Made a test to profile the memory usage and performance of three scenarios:

  1. load all from local. clock time: 1m24.2s. memory increment: 1686.69 MiB

    cf_boundary = gpd.read_file("../data/brpgewaspercelen_definitief_2020.gpkg")
  2. load with bbox from local. clock time: 1.4s. memory increment: 20.68 MiB

    cf_boundary = gpd.read_file("../data/brpgewaspercelen_definitief_2020.gpkg", ,bbox=bbox)
  3. load with bbox from remote. clock time: 42.9s. memory increment: 1016.53 MiB

    cf_boundary = gpd.read_file("../data/brpgewaspercelen_definitief_2020.gpkg", ,bbox=bbox)

The clock time is counted with the Jupyter Notebook. Mem profiling with the memit magic command.

In conclusion, the second way seems to be optimal. We then can ask the participants to download the data first and read it with the bbox argument.