azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Investigate generating Flood Inundation Maps (FIM) directly from s3 #108

Closed vlulla closed 1 year ago

vlulla commented 2 years ago

Perusing the shell scripts listed in the inundation-mapping repo to see how fims are generated we see that the python scripts are designed expecting data to be located at /data/ folder and the repo src to be located at the /foss_fim/src/ folder. Getting the python repo source into the k8s cluster appears to be straightforward (can be done by a custom Dockerfile) but how do we get the data into the k8s cluster? The data that we need is hosted on the ESIP s3 bucket (s3://noaa-nws-owp-fim/). Additionally, we also need BLE forecast files for generating fims.

For our trial run of generating fim we were provided the BLE forecast files. And, from what I understand, these BLE forecast files are created by some proprietary procedure which is unavailable to us. So, in addition to using the cloud hosted hydrofabric (from ESIP) we will also have to save these ble csv forecast files, possibly in another s3 bucket, and modify the python (and/or bash) scripts so that these scripts can use the hydrofabric and these ble csv files together to generate the fim.

Therefore, we have to figure out how to restructure the python scripts so that they can use data directly from s3 buckets (this was item 5 in Fernando's email from 2022.08.30!) Once we have the modified python scripts I believe that running them on our k8s cluster ought to be straightforward.

So, as I understand it, basically how do we get this

$ /foss_fim/tools/inundation.py -r /data/outputs/3dep_test_1202_10m_FR/12020001/rem_zeroed_masked.tif \
      -c /data/outputs/3dep_test_1202_10m_FR/12020001/gw_catchments_ reaches_filtered_addedAttributes.tif \
      -b /data/outputs/3dep_test_1202_10m_FR/12020001/gw_catchments_ reaches_filtered_addedAttributes_crosswalked.gpkg \
      -t /data/outputs/3dep_test_1202_ 10m_FR/12020001/hydroTable.csv \
      -f /data/test_cases/ble_test_cases/validation_data_ble/ 12020001/500yr/ble_huc_12020001_flows_500yr.csv \
      -i /data/temp/testing_inundation_fr_500yr_12020001.tif

to work without using data stored locally?

jpolchlo commented 2 years ago

It may be possible to mount S3 buckets as a volume and pass them into the pod that will run the FIM code. It's not exactly the recommended best practice, and if we're planning on doing additional development to the FIM code, we should just fix the file access to use the AWS SDK; but if the intent here is just to get a "proof of life" w.r.t. the provided code, then this might be an OK approach. Ultimately, this could be run through an Argo workflow with the required volume mounts.

jpolchlo commented 1 year ago

After talking about this problem, it is likely that the best solution to make these data available to the execution environment will be to create a persistent volume claim that can be mounted to whichever pod runs the FIM workflow via Argo. To start this process, there are two primary steps:

  1. The docker image that provides the necessary environment needs to be built and uploaded to ECR (@vlulla will handle this)
  2. The PVC has to be created and pre-loaded with the contents of s3://azavea-noaa-hydro-data/FIM-example-data-from-Fernando/ (I'll take care of this)
    • Note: I'm going to place this PVC in the argo namespace, since it doesn't seem like Dask is required to execute this job.

Once the PVC exists, it can be mounted by the Argo workflow that will execute the job. We may have to figure out how to ensure that the worker pod is placed in the same AZ as the EBS volume. We can work together to write the workflow so that the volume is placed where it needs to be and there are enough compute resources available to the pod.

vlulla commented 1 year ago

This sounds great! We don't need both the FR and GMS 7z on the ebs while we are figuring this out. I believe that it might be easier to begin with FR (smaller of the 7z) dataset to figure out how to run this as argo workflow and then we can try the bigger 7z (for GMS) workflow later. And oh, we'll definitely need the BLE (Base Level/Flood Elevation) CSVs to be included in the ebs regardless of whether we use FR or GMS datasets.

jpolchlo commented 1 year ago

This appears to be settled. I've used the EFS volume strategy in #124 and we can access the files we need. Not elegant, but it is a solution. Efforts on the more elegant approach via FSx for Lustre (see azavea/kubernetes#40) did not go to plan. Closing, but feel free to reopen.