DOI-USGS / lake-temperature-process-models

Creative Commons Zero v1.0 Universal
1 stars 4 forks source link

Pipeline hanging on builds? #9

Closed hcorson-dosch-usgs closed 2 years ago

hcorson-dosch-usgs commented 2 years ago

I noticed today while building the pipeline repeatedly that it was hanging on builds of p2_glm_uncalibrated_runs and p2_glm_uncalibrated_output_feathers -- i.e., starting to build targets but then hanging and not printing the typical build statements. In most cases, it would eventually print out a bunch in a row and sucessfully complete the build, but a few times I had to stop the build and call tar_make() again to get it to finish the build.

I was pushing the pipeline hard today - tweaking code and triggering lots of rebuilds repeatedly, so I'm not sure if this is actually an issue or not. @lindsayplatt and @jread-usgs if you build while reviewing #6 , maybe just comment here if you notice the pipeline hanging on either of those two p2 targets?

jordansread commented 2 years ago

My builds seemed to work fine on #6, but I was routinely failing on one of the runs image

I was assuming that was related to missing a meteo file for that lake or something.

hcorson-dosch-usgs commented 2 years ago

That means that one of the glm runs for that lake - gcm combo failed, since failing means meteo_fl and meteo_fl_hash are both set to NA in p2_glm_uncalibrated_runs, which then means the p2_glm_uncalibrated_output_feathers branch for that lake-gcm combo can't be built. Have you checked the p2_glm_uncalibrated_runs tibble to see if there was a failed run noted in there?

jordansread commented 2 years ago

Yes, I seem to have a failed run

p2_glm_uncalibrated_runs %>% print(n = 100)

image

hcorson-dosch-usgs commented 2 years ago

ohh weird, but the glm_code is 0 which should only be the case if it's successful 🤔. You may have turned up a bug.

jordansread commented 2 years ago

the code isn't totally reliable as a measure of whether the models fully ran. I think you can get a 0 if the model doesn't make it all the way through the simulation period, for example. I used to check this by verifying the end of the time period of the .nc file lined up with the final date of in the nml in stop. Could be that issue(?)

hcorson-dosch-usgs commented 2 years ago

Ah, that's really good to know. Yes sounds like we might need a more robust check of whether the run was successful. But what's odd here is that that export_fl name should I think only have been set to NA (i.e. reached the error function of the tryCatch) if the code was not 0), so I'll need to think through how that might have happened when I get the chance later.

hcorson-dosch-usgs commented 2 years ago

Closing this issue b/c it doesn't seem to have persisted and we've added in addition checks to confirm that the run was successful in #10