[EPIC] Fixes, Enhancements and Improvements for CatFIM

EmilyDeardorff commented 4 months ago

This is an EPIC card. As items are taken off this list, they will get moved into their own cards, referenced back to this as a total list of all known items, mostly are brand new so prioritization is required

Currently, the CatFIM process takes a very long time to run (~25 hrs for a full BED run) and lacks efficiency both for full runs and for partial/testing runs. Update the CatFIM scripts to incorporate more efficient multiprocessing and segmented run capabilities (i.e. running only on a list of HUCs and/or a flag that allows CatFIM to re-run without downloading all of the metadata again).

These are above and beyond finishing the AK integration

Obviously, we won't get all in, but this is a list of all possible fixes. We will extract out some for the next CatFIM release when the AK integration is done. However, some in this list are very important.

Potential CatFIM Fixes and Upgrades:

1) (Lower priority, should be fixed by AWS shift) Improve efficiency for calling the WRDS API. In our "get_thresholds" in generate_categorical_fim_flows.py, it calls WRDS to get some nws_lid but only passes in lid data and not huc data. BUT... our process_generate_flow function, it calls that get_threshhold for every huc and every ahps site. This is a massive amount of duplication. We can call WRDS for each ahps or even all ahps via the specific WRDS api call, save it to our disk and use it. We can likely even load it in certain parts of the code and pass that dataframe around to other functions including into a Multi-Proc. This would be a huge performance gain.

2) (Lower priority, should be fixed by AWS shift) Facilitate a pre-download option for the threshold files (similar to what we've done with the metadata files). We now have the ability to save metadata file from WRDS in generate_categorical_fim_flows.py but can we make a similar system for the threshold files so we can run catfim tools in a non OWP server enviro? Assuming that the fix has not yet been made to let AWS all the WRDS api. Can we just get a new DB or something we can put on a file server instead of making 5,000+ Web API calls which is a huge amount of overhead?

3) Review and maybe upgrade the HUC_messages system. The system creates a file per HUC / lid which says if if the lid was loaded or if it failed and why. It later reloads all of the files in the "huc_messages" folder, maps them to each record for the output csv with some saying why they failed. There are huge opportunities for improvement here, but it does work so it may not be worth the effort to fix. The current setup is "multi-proc" safe. Attached is one example of one ahps site that was not fully able to load catfim and the message why. The image is in two parts, limitations of screen caps.

Examples of how huc_messages turn into status and end up in the UI. Catfim_Stage_example_failed_load_in_UI_part_1

4) (Priority) Add in an argument to remove intermediate extent files. In the mapping/{huc}/{lid} are a bump of inundated images changed extent files. It is done to each branch, then merged to a single "extent" file for that {huc}/{lid}/{lid}_{magnitude}_extent.tif. We obviously need to keep the final extent files per magnitude and not all of the branch intermediary files. Note: that system is now in but the code commented for version or two for tracing reasons. Maybe uncomment in the next HV 2.1.8 release (in a few months) to clean up all of those intermediate tif files as there is A LOT. Update: We have commented out code in there to delete them, just make it driven by an arg to delete or not (default delete). Can helps with debugging during dev. Add an argument in for deleting or not, default to delete. Nice to have when code/debugging and most of the code is there, easy to finish. (1286)

5) Fix duplication of processing paths for stage and flow based as it calls functions in the code.

6) (Should be fixed by AWS shift) Adjust and review the multi-proc job number arguments. We have three job numbers, but we have some parts that are MP, inside MP, inside MP. ie process_generate_categorical_fim -> (MP) iterate_through_huc_stage_based -> (MP) produce_stage_based_catfim_tifs -> (MP) produce_inundation_map_with_stage_and_feature_ids. This may not be the most efficient use of jobs or even MP but needs to be reviewed. There are others such as post_process_cat_fim_for_viz -> (MP) post_process_huc_level -> (MP) reformat_inundation_maps

7) Test and re-eval the "past_major_interval_cap" system. (argument of -mc / --past_major_interval_caps). Does it still work? Do we want to keep it and maintain that code? -> Yes! (as of 9/11/24) 1283

8) Test and eval if we want the search system (argument of -s / --search)). Does it still work? Do we want to keep it and maintain that code? -> Yes! (as of 9/11/24) 1283

9) ~Test and fix OR remove the lid choice system (arguments of -l / --lid_to_run). It has been broken since winter 2024. Could be handy but might not be worth the effort to fix it.~ -> 09/11/24 - Decided to remove this feature in lieu of finishing adding the new list of HUCs system, # 10 below. 1287

10) (Priority) Fix Emily's list of HUCs to run argument (argument of -lh / --lst_hucs). This is very handy, new to CatFIM 2.0 but we temp disabled it, so it should be quick to re-enable. Lower priority for v2.1, but nice to get in there. Prioritized over #9 as of 9/11/24

11) Finish the new "step" system, which allows for us to skip flows if all of the files are already in place to go straight to inundation. Maybe consider adding more "step" points?

12) Review test / fix using the generate_categorical_fim_mapping.py and generate_catfim_flow.py scripts to see if they work from command line. I rebuilt the generate_categorical_fim_mapping.py part way through the rebuild but did not come back to finish it. The generate_catfim_flow.py script is definitely broken. More of a rewrite task, lower priority as of 9/11/24.

13) (Priority) Clean up doc string notes below most functions. Most are very, very out of date. For now, either finish a doc string section to be correct or remove if not. Important! But doesn't need to be its own issue. Good to be fixing as we go.

14) (Priority) Review and cleanup the lid sites rule such as special cases for lids based on crs, datum_adjust, etc. It is now it's own issue 1272 Big deal! Out-of-date hard coding could be causing a lot of the issues in the subsequent bullet points. A note from Carson: Would be good to add "and" statements to make sure that the known issue is true before applying a hard-coded fix. Pseudocode example: if LID == fhjs5 AND datum == NULL, set datum = ‘NAD83’

Note from Derek: It would be good to add the AHPS restricted sites list to the catfim folder of our online inundation-mapping repo so that the field can check and provide feedback on whether certain sites can be removed from the list. Would be good to sort the LIDs by RFC and then alphabetical order. new service / feature someday? xls list? other?

15) (Priority) Review Errors from the 4.5.2.11 flow based and stage based full runs. 1273.

16) (Priority) Review / Fix logic for Stage Based Sea Level sites. 1274.

17) (Priority) Ensure that the stage_based_catfim_sites.csv file has the 'nws_lid' column renamed to 'ahps_lid'.

update. Now fixed in new dev-catfim-v2-1 branch.

18) (Priority) FIM 4.5.2.11 - CatFIM stage site issues - jrsu1 and okfi2 1275

19) (Added Sep 4, 2024) CatFIM - issues with new docker image. 1277

update. Now fixed in new dev-catfim-v2-1 branch.

20) (Added Sep 4, 2024) Review sites that were dropped since 4.4.0.0 to 4.5.2.11. They might be related # 1274, or 1277above but might not. Derek wants us to compile and review all. Rob will put a card in for it. Related: HydroVIS has to parts of there system which can drop ahps sites as well, so we will reconcile that with the HV team at the same time. In HV, they compare our sites coming in against a DB of their own named something like "ahps_restricted_sites" and drops more. We need to reconcile that as part of this task, plus 1274 and 1275.

update: One of the five places we drop (or change mapped = no) for sites is the HV test against the "aphs_restricted_sites". Derek said it is fine for now but will double check later to see how up to date it is. While we won't worry about the status of the data in the aphs restricted sites db, we will pull that data and logic over to FIM catFIM processing code. 1288. All part of site reconciliation.
Update: Sep 23: For the "sites" list, it was discovered that if the t does appear that the sites that don't have all 4 thresholds were dropped in this new version. This is likely incorrect. But watch for 4.4.0.0 records where their might be missing threshold stages and all categories failed to map. We will need to confirm rules of when a record shoudl be dropped in relation to partial threshholds. This logic appears to have not changed sine 4.4.0.0, but may be wrong in the first place.
Update: Sep 23: Some HUC8s appear to have changed on some records. Research required.
Update: Sep 23: review actual status msg's. ie)
- 4.4.0.0 was: "large discrepancy in elevation estimates from gage and HAND and all categories failed to map". 4.5.2.11 now shows "large discrepancy in elevation estimates from gage and HAND"
  - 4.4.0.0 was "no stage values available and all categories failed to map", now "no stage values available"
  - 4.5.2.11: There are a bunch saying status of "undermined". This was a backup system and any record showing that value, means their is a code logic error.
  - 4.4.0.0 was: "likely unacceptable gage datum error or accuracy code(s); please see acceptance criteria and all categories failed to map". 4.5.2.11 is "usgs_elev_table missing, likely unacceptable gage datum error -- more details to come in future release"
  - Update Sep 23: How can there be any miss matches on AHPS id's. both 4.4.0.0 and 4.5.2.1, have some that do not matching ahps_id record.

21) *(Added Sep 6, 2024): For the all four .csv's that we create, drop the geometry column and create four more files with the name "_metadata" in them. This will make for smaller versions for purely analyzing non geo columns, like status, is mapped flag, and other values in it. It is purely a debugging tool but only takes 2 line of code to be added and would help quite a bit.

22) (Added 9/11/24): Make a CatFIM folder to store the CatFIM scripts!1282

23) (Added 9/9/24): Hawaii points disappeared from stage-based CatFIM in release 4.5.2.11 1279. Keep an eye on other tasks relating to lid reconciliation as they might/might not be related.

24) (Add 09/23/24): Ensure we compare mapped = Yes from out site list, and ensure there is at least one rec in library file. Emily will look into this.

RobHanna-NOAA commented 4 weeks ago

Some of these tidbits were fixed in PR 1165 which is the rebuild of CatFIM. However, this Issue will remain open as there are more things to cover, each a different importance levels.

EmilyDeardorff commented 2 weeks ago

Notes from 9/11/24 meeting

1 & 2: These issues will probably go away when we move CatFIM to AWS.

4: Should be quick.

5: A simple reorganization task.

6: A larger reorganization task, but will probably not be needed when we move CatFIM to AWS.

7&8: Need too make an issue, we have decided to maintain these functions. Would be good to clarify their function in docs and add comments in the code.

9&10: Choose one because they’re very similar. I think we will choose #10 because it’s closer to finished.

11: Do it!

12: More of a rewrite task, lower priority.

13: Important! But doesn’t need to be its own issue. Rather, this documentation is something that we can and should be fixing as we go.

14: Big! Out-of-date hard coding could be causing a lot of the issues noted in the subsequent bullet points. A note from Carson: Would be good to add “and” statements to make sure that the known issue is true before applying a hard-coded fix. Pseudocode example: if LID == fhjs5 AND datum == NULL, set datum = ‘NAD83’

It would be good to add the AHPS restricted sites list to the catfim folder of our online inundation-mapping repo so that the field can check and provide feedback on whether certain sites can be removed from the list. Would be good to sort the LIDs by RFC and then alphabetical order.

15: New issue: Make a CatFIM folder to store the CatFIM

RobHanna-NOAA commented 2 weeks ago

For Test and fix OR remove the lid choice system (arguments of -l / --lid_to_run). It has been broken since winter 2024. Could be handy but might not be worth the effort to fix it. -> Removed 9/11/24. I think we decided to drop that system in favor of finishing the filter by huc system (# 10?). I think it is not wise to leave in a feature that we don't know if it works especially if we are replacing it with # 10 hucs. I will put in a card to drop that feature.

CarsonPruitt-NOAA commented 2 weeks ago

This might be something we tackle after addressing all of this technical debt, but I wanted to track it here since it's a CatFIM task...

It's been suggested that we map Stage-Based CatFIM for sites that are not forecast sites. There are some AHPS points that are observation-only but still have flood thresholds that we could map. Even more sites for Stage-based (yay!).

NOAA-OWP / inundation-mapping

[EPIC] Fixes, Enhancements and Improvements for CatFIM #1182

This is an EPIC card. As items are taken off this list, they will get moved into their own cards, referenced back to this as a total list of all known items, mostly are brand new so prioritization is required

Notes from 9/11/24 meeting