NOAA-OWP / inundation-mapping

Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.
Other
92 stars 28 forks source link

[13pt] Enable FIM 3 to run on Prod #8

Closed fernando-aristizabal closed 3 years ago

fernando-aristizabal commented 4 years ago

Enable MaaS framework to work with FIM 3+

fernando-aristizabal commented 4 years ago

Changing the name of this task to reflect changes in strategy. DMOD contributions are put on-hold indefinitely. Priorities are (in-order): 1) Manually run a regional scale test with all available input data on fim_share @ HUC6 & 8 job sizes. a) Provide computational results (CPU time, concurrent jobs, peak RAM per job, storage use, etc) for both sets and summary of exit code statuses. 2) Manually run CONUS & PR domain a) Provide same information as above. 3) Script production runs and regression tests so that these use cases can be called by dev team when required. This will be tracked as a separate issue once 1 and 2 are complete.

Thanks!

@nickchadwick-noaa @frsalas-noaa @ZacharyWills

frsalas-noaa commented 4 years ago

@nickchadwick-noaa what would be your best estimate on ETA for demonstrating (1)? Same question for demonstrating (2)? Note, for (2) we'll need to download the data over the entire U.S.

fernando-aristizabal commented 4 years ago

@nickchadwick-noaa @frsalas-noaa We have a meeting Friday to review results on task 1. Assuming that goes fine, we will download the remaining CONUS/PR over the wknd onto fim_share and scp/rsync over on Monday.

ZacharyWills commented 4 years ago

Can we get an outline of the metrics we want to collect? I just want to be sure we capture all of them on the longer runs.

frsalas-noaa commented 4 years ago

Cool. When downloading, please also download the NHDPlus data over Hawaii. We'll need that later.

fernando-aristizabal commented 4 years ago

@ZacharyWills The computational metrics for fim_run.sh are outputted per job within each job's log. @nickchadwick-noaa knows what's up here. They are all within the full version of GNU time output. The relevant ones are Wall-Clock time and Maximum Resident Size (RAM). If we happen to swap, we would need to capture the Major Page Faults line too. Knowing the number of jobs we ran we can compile descriptive statistics like median, mean, and std. We also want to document the value passed for concurrent jobs (-j argument in fim_run.sh). We don't have an elegant way of collecting these metrics across jobs right now but some simple recursive grep's work fine for me.

ZacharyWills commented 4 years ago

Nice! We probably want to take a day or two and add this kind of info to the README.md at the root of the repo.

fernando-aristizabal commented 4 years ago

Yeah ideally these comp metrics would be parsed by the regression testing functionality then logged and reported.

nickchadwick-noaa commented 4 years ago

I ran the first successful bulk parallel run on the Production machine. Here are some of the metrics I came up with.

The run consisted of 22 job with 1 HUC8 per job:

No memory swapping happened, as was asked above.

The next steps will to be downloading CONUS data to the Production machine, further refining the production docker image, adding scheduling to the bulk parallel run process, and capture progress and other metrics while everything is running.

The plan for scheduling as of right now is to define a hard memory limit that a job can have, and continue to schedule jobs until all of CONUS is done. The number of jobs running at any given time will probably be something like allowed_running_jobs=(system_memory-reserved_memory)/job_memory_limit.

Another thing to consider in determining the number of jobs that can be run at one time is CPU usage. Each job seems to be using up 1 CPU core, meaning that the number of concurrent jobs will not be able to exceed the number of cores on the Production machine (I'm not sure if I'm allowed to say here how many cores the machine has which is why I'm being vague).

@fernando-aristizabal is planning the download of the CONUS data for next week and then soon after we can do a larger scale test to determine the proper memory constraints to put on a single job.

frsalas-noaa commented 4 years ago

Thanks for the update. Were we able to mount the shared directory to the MaaS machines or do we need to transfer data from one location to the other?

ZacharyWills commented 4 years ago

I think a discussion needs to be had on versioning the data with the software. The team has thought about this a bit, but having some sort of [test version [[software. v1.2][data v2.2]] or something might be helpful, with a checksum of each piece so we can assure when we run either tests or prod that the. version we're using is the right one, and that it's a reproducible improvement when we make a change. Let me know what you all think.

fernando-aristizabal commented 4 years ago

Yes we do. dvc.org is the plan

fernando-aristizabal commented 4 years ago

Totals: 230 HUC4, 359 HUC6, 2201 HUC8 in NWM Domain

Moving updated NWM inputs that include HI to dev drive. Also downloading entire NWM domain input data from USGS onto dev drive. We can rsync over to prod early next wk for 1st attempt at NWM domain run.

fernando-aristizabal commented 4 years ago

2-3 hours of expected run time with HUC8 job-level so should be good to find the non-zero exit codes and start debugging.

ZacharyWills commented 4 years ago

Does the 2-3 hours time mentioned reflect a test run or CONUS expectation?

fernando-aristizabal commented 4 years ago

All non-zero exit codes have been addressed in all HUC8's in entire FIM domain which includes CONUS, Hawaii, Puerto Rico, Canada, and Mexico. The FIM domain is the intersection of the NWM domain with the NHD+HR domain.

The only non-zero exit codes thrown are three HUC8's that have partial NHD+HR data (do not have NHD+HR burn or flow lines). They are 04240001 (Manitoulin Island, Ontario, Canada), 09040003 (Alberta, Canada), 10170104 (South Dakota & Nebraska border). This will be created in a separate issue where the dev assigned should either do some sort of geometric check or attribute query for full input data availability (most correct method) or hard code these out of the WBD_National.gpkg - WBDHU8 layer (less programmer time).

fernando-aristizabal commented 4 years ago

39 has been created to address these three remaining non-zero exit codes.

Since prod is more than one node, we should probably aim to be able to pre-split the entire FIM domain queue to 1 to X queues based on a balanced square km criteria to pseudo load balance for now. Once this is demonstrated, it should complete this issue.

nickchadwick-noaa commented 4 years ago

Hello all,

Here is the latest numbers on a CONUS run of the FIM 3 software. These estimates should be very accurate as this sample size was already about 99% of CONUS (minus a handful of non-zero exits and a few missing HUCS).

Average Memory Usage Per Container: 4.64 GB

Average Runtime Per Container: 23.91 Minutes

Total Output Size: 4552.98 GB Average Output Size Per Container: 2.11 GB

Estimated Runtime For CONUS (using one machine): 10.31 Hours Estimated Runtime For CONUS (using two machines): 5.15 Hours Estimated Runtime For CONUS (using three machines): 3.43 Hours Estimated Runtime For CONUS (using four machines): 2.57 Hours

Estimated Output Size Per Machine For CONUS (using 1 machine): 4.53 TB Estimated Output Size Per Machine For CONUS (using 2 machines): 2.26 TB Estimated Output Size Per Machine For CONUS (using 3 machines): 1.51 TB Estimated Output Size Per Machine For CONUS (using 4 machines): 1.13 TB

The final step for this issue will be to utilize both of the production machines, which will require a script of sorts that splits the inputs and starts the process on each production machine.

In addition to that, we will need to figure out the best way to concatenate and move the outputs to their final destination.

nickchadwick-noaa commented 3 years ago

I have created an additional issue (#60) that deals with running the FIM 3 across multiple production machines, as this issue was only about demonstrating a full FIM Domain run on the production hardware which has now been done successfully.

Additionally the production Dockerfile is being tracked in issue #5 so there is no pull request for this issue specifically.