aiddata / geo-datasets

Scripts for preparing datasets in GeoQuery
http://geoquery.org
MIT License
20 stars 11 forks source link

Use local scr for initial downloads / heavy writes #148

Open sgoodm opened 1 year ago

sgoodm commented 1 year ago

As a best practice given the potential scale of many of our pipelines (i.e., potentially running dozens+ tasks across nodes), I'd like to minimize the continuous IO on our main file system. This should be relatively simple by just using the node's local disk for downloads / processing before copying to the main file system. Even in scenarios where operations are quick and not heavy IO, this shouldn't slow jobs down much and worth the extra piece of mind.

(Currently, heavy IO is seemingly causing extra issues on our aging file system, but we are in the process of moving everything to a brand new file system.)

sgoodm commented 1 year ago

@jacobwhall at some point we can update completed pipelines to adhere to this practice, but most of the recently complete ones should be pretty light weight (and have already been fully run so near-future IO isn't an issue).

jacobwhall commented 1 year ago

Many of the dataset scripts currently follow a model like this:

  graph LR;
      id1(data source)-- download tasks -->raw_dir;
      raw_dir-- process_tasks -->output_dir;
      output_dir-- ingest system -->GeoQuery;

Setting the raw_dir to be in a local scratch folder (e.g. ~/lscr/TMPDIR on W&M HPC) would greatly reduce the I/O on shared filesystems. We can make this a default location for raw_dir moving forward. Doing so will require changes to how the Dataset class currently assigns tasks to workers. Since download tasks and process tasks (as in above graph) are currently assigned separately, it's likely that the worker that processes a file will be different than the one that downloaded it, and won't have that file available in its local scratch directory. We'll need to ensure that the same worker does both of these tasks.

jacobwhall commented 1 year ago

From our conversation today, it sounds like raw_dir and output_dir both need to be permanent archives, so we can't have either of them point to a local scratch directory. We should specify a tmp_dir to write files into from tasks, and then at the end of each task write the file to its final destination. Perhaps a Dataset class function could manage the tmp_dir for us