GNS-Science / toshi-hazard-post

Hazard post-processing pipeline as serverless AWS infrastructure.
MIT License
0 stars 1 forks source link

Pre-load realization data (pyarrow) #32

Closed chrisdicaprio closed 3 months ago

chrisdicaprio commented 4 months ago

We want to pre-load data or a data scanner before distributing jobs to eliminate redundant searches of the data and reduce memory use. Invistiate how to filter pyarrow in stages and share results across MultiProcessing processes.

Currently data is partitioned on 1.0 deg tiles. Without this, access using S3 is quite slow. The pre-loading and distribution of data/scanner will have to obay the partitioning used.

Is this sequence sensable/possible? NB: we can filter on the partition so it doesn't have to be explicitly defined when creating the dataset object (https://arrow.apache.org/docs/python/dataset.html#reading-partitioned-data)

chrisdicaprio commented 3 months ago

we found that pre-loading and sharing a dataset is not possible with multiprocessing and pyarrow. All loading and filtering is done by each process seperatly