Pre-load realization data (pyarrow) - Githubissues

GNS-Science / toshi-hazard-post

Hazard post-processing pipeline as serverless AWS infrastructure.

MIT License

0 stars 1 forks source link

Pre-load realization data (pyarrow) #32

Closed chrisdicaprio closed 3 months ago

chrisdicaprio commented 4 months ago

We want to pre-load data or a data scanner before distributing jobs to eliminate redundant searches of the data and reduce memory use. Invistiate how to filter pyarrow in stages and share results across MultiProcessing processes.

Currently data is partitioned on 1.0 deg tiles. Without this, access using S3 is quite slow. The pre-loading and distribution of data/scanner will have to obay the partitioning used.

only scan a single partition at a time due to network latency -> highest order loop is over partition bins
pre-filter the branches before passing to a calculation (if possilble)
on next looping level, (IMT?) filter the IMTs
only on the last level filter for the site (vs30 and locaiton)

Is this sequence sensable/possible? NB: we can filter on the partition so it doesn't have to be explicitly defined when creating the dataset object (https://arrow.apache.org/docs/python/dataset.html#reading-partitioned-data)

chrisdicaprio commented 3 months ago

we found that pre-loading and sharing a dataset is not possible with multiprocessing and pyarrow. All loading and filtering is done by each process seperatly