Closed fernando-aristizabal closed 3 years ago
Changing the name of this task to reflect changes in strategy. DMOD contributions are put on-hold indefinitely. Priorities are (in-order): 1) Manually run a regional scale test with all available input data on fim_share @ HUC6 & 8 job sizes. a) Provide computational results (CPU time, concurrent jobs, peak RAM per job, storage use, etc) for both sets and summary of exit code statuses. 2) Manually run CONUS & PR domain a) Provide same information as above. 3) Script production runs and regression tests so that these use cases can be called by dev team when required. This will be tracked as a separate issue once 1 and 2 are complete.
Thanks!
@nickchadwick-noaa @frsalas-noaa @ZacharyWills
@nickchadwick-noaa what would be your best estimate on ETA for demonstrating (1)? Same question for demonstrating (2)? Note, for (2) we'll need to download the data over the entire U.S.
@nickchadwick-noaa @frsalas-noaa We have a meeting Friday to review results on task 1. Assuming that goes fine, we will download the remaining CONUS/PR over the wknd onto fim_share and scp/rsync over on Monday.
Can we get an outline of the metrics we want to collect? I just want to be sure we capture all of them on the longer runs.
Cool. When downloading, please also download the NHDPlus data over Hawaii. We'll need that later.
@ZacharyWills The computational metrics for fim_run.sh are outputted per job within each job's log. @nickchadwick-noaa knows what's up here. They are all within the full version of GNU time output. The relevant ones are Wall-Clock time and Maximum Resident Size (RAM). If we happen to swap, we would need to capture the Major Page Faults line too. Knowing the number of jobs we ran we can compile descriptive statistics like median, mean, and std. We also want to document the value passed for concurrent jobs (-j argument in fim_run.sh). We don't have an elegant way of collecting these metrics across jobs right now but some simple recursive grep's work fine for me.
Nice! We probably want to take a day or two and add this kind of info to the README.md at the root of the repo.
Yeah ideally these comp metrics would be parsed by the regression testing functionality then logged and reported.
I ran the first successful bulk parallel run on the Production machine. Here are some of the metrics I came up with.
The run consisted of 22 job with 1 HUC8 per job:
No memory swapping happened, as was asked above.
The next steps will to be downloading CONUS data to the Production machine, further refining the production docker image, adding scheduling to the bulk parallel run process, and capture progress and other metrics while everything is running.
The plan for scheduling as of right now is to define a hard memory limit that a job can have, and continue to schedule jobs until all of CONUS is done. The number of jobs running at any given time will probably be something like allowed_running_jobs=(system_memory-reserved_memory)/job_memory_limit
.
Another thing to consider in determining the number of jobs that can be run at one time is CPU usage. Each job seems to be using up 1 CPU core, meaning that the number of concurrent jobs will not be able to exceed the number of cores on the Production machine (I'm not sure if I'm allowed to say here how many cores the machine has which is why I'm being vague).
@fernando-aristizabal is planning the download of the CONUS data for next week and then soon after we can do a larger scale test to determine the proper memory constraints to put on a single job.
Thanks for the update. Were we able to mount the shared directory to the MaaS machines or do we need to transfer data from one location to the other?
I think a discussion needs to be had on versioning the data with the software. The team has thought about this a bit, but having some sort of [test version [[software. v1.2][data v2.2]] or something might be helpful, with a checksum of each piece so we can assure when we run either tests or prod that the. version we're using is the right one, and that it's a reproducible improvement when we make a change. Let me know what you all think.
Yes we do. dvc.org is the plan
Totals: 230 HUC4, 359 HUC6, 2201 HUC8 in NWM Domain
Moving updated NWM inputs that include HI to dev drive. Also downloading entire NWM domain input data from USGS onto dev drive. We can rsync over to prod early next wk for 1st attempt at NWM domain run.
2-3 hours of expected run time with HUC8 job-level so should be good to find the non-zero exit codes and start debugging.
Does the 2-3 hours time mentioned reflect a test run or CONUS expectation?
All non-zero exit codes have been addressed in all HUC8's in entire FIM domain which includes CONUS, Hawaii, Puerto Rico, Canada, and Mexico. The FIM domain is the intersection of the NWM domain with the NHD+HR domain.
The only non-zero exit codes thrown are three HUC8's that have partial NHD+HR data (do not have NHD+HR burn or flow lines). They are 04240001 (Manitoulin Island, Ontario, Canada), 09040003 (Alberta, Canada), 10170104 (South Dakota & Nebraska border). This will be created in a separate issue where the dev assigned should either do some sort of geometric check or attribute query for full input data availability (most correct method) or hard code these out of the WBD_National.gpkg - WBDHU8 layer (less programmer time).
Since prod is more than one node, we should probably aim to be able to pre-split the entire FIM domain queue to 1 to X queues based on a balanced square km criteria to pseudo load balance for now. Once this is demonstrated, it should complete this issue.
Hello all,
Here is the latest numbers on a CONUS run of the FIM 3 software. These estimates should be very accurate as this sample size was already about 99% of CONUS (minus a handful of non-zero exits and a few missing HUCS).
Average Memory Usage Per Container: 4.64 GB
Average Runtime Per Container: 23.91 Minutes
Total Output Size: 4552.98 GB Average Output Size Per Container: 2.11 GB
Estimated Runtime For CONUS (using one machine): 10.31 Hours Estimated Runtime For CONUS (using two machines): 5.15 Hours Estimated Runtime For CONUS (using three machines): 3.43 Hours Estimated Runtime For CONUS (using four machines): 2.57 Hours
Estimated Output Size Per Machine For CONUS (using 1 machine): 4.53 TB Estimated Output Size Per Machine For CONUS (using 2 machines): 2.26 TB Estimated Output Size Per Machine For CONUS (using 3 machines): 1.51 TB Estimated Output Size Per Machine For CONUS (using 4 machines): 1.13 TB
The final step for this issue will be to utilize both of the production machines, which will require a script of sorts that splits the inputs and starts the process on each production machine.
In addition to that, we will need to figure out the best way to concatenate and move the outputs to their final destination.
I have created an additional issue (#60) that deals with running the FIM 3 across multiple production machines, as this issue was only about demonstrating a full FIM Domain run on the production hardware which has now been done successfully.
Additionally the production Dockerfile is being tracked in issue #5 so there is no pull request for this issue specifically.
Enable MaaS framework to work with FIM 3+