DOI-USGS / lake-temperature-process-models

Creative Commons Zero v1.0 Universal
1 stars 4 forks source link

Update shifter image on Denali to latest version of GLM, after initial testing #5

Open hcorson-dosch-usgs opened 2 years ago

hcorson-dosch-usgs commented 2 years ago

Once we have a functioning workflow on Denali, we'll want to update the shifter image that David and Alison created for the reservoir modeling to the latest version of GLM (it's now version 3.1), or just make a new image, if that is easiest.

hcorson-dosch-usgs commented 2 years ago

For the most recent GCM projections work Jordan was using version 3.2.0a3. However, there are no tags or releases on the AquaticEcoDynamics github beyond 3.1. When Jordan is back from leave we'll need to confirm with him how to access that version of GLM.

hcorson-dosch-usgs commented 2 years ago

@jread-usgs - pinging you here now that you're back. Our current shifter image that @jesse-ross created is using the latest AquaticEcoDynamics release. Did you want us to use a more recent version, and if so, how do we access it?

jordansread commented 2 years ago

This is pretty hacky, but because they aren't reliably doing releases, I've been using changes to this file with history to pull the repo from a given commit when the version was updated.

hcorson-dosch-usgs commented 2 years ago

Ok so looks like the latest there is version 3.2.0a6. I believe you tried that version for the initial GCM projections work, but were running into issues, so reverted to 3.2.0a3. Which version would you like us to use for the MN set of projections?

jordansread commented 2 years ago

given what we know now, 3.2.0a3 is best to use. But it would be great if our container recipe gave us the flexibility to move to a different version (or commit/tag) for GLM in the future. Unless #15 turns up an issue with this version not recognizing the param...

hcorson-dosch-usgs commented 2 years ago

I should note that for #15 I'm running GLM locally, and using version 3.1.0a4

jordansread commented 2 years ago

Good to know. The disable_evap param has been exposed since v3.0, so it should be working the same(?) for any version at or above that. But this is a funky result and I can add some more thoughts on the other issue specific to the evap question.

hcorson-dosch-usgs commented 2 years ago

whoops wrong issue

jesse-ross commented 2 years ago

Rebuilding with 3.2.0a3 is easy to do in theory but I'm running into some snags. That version won't compile because it's trying to use a function from AquaticEcoDynamics/libplot which doesn't yet exist in the version I'm pinned to. The thing is, I don't know why I'm pinned there. I'm blindly following Alison's build script which is based on something of yours, @jread-usgs . So I stopped pinning to the old libplot and built it from the most current sources. Does this seem OK?

In any case, the test script doesn't work with v3.2.0a3, but not because of libplot:

> run_glm(sim_folder)
Cannot open default display
Unknown flag --no-gui

    -------------------------------------------------------
    |  General Lake Model (GLM)   Version 3.2.0a3         |
    -------------------------------------------------------

The --no-gui flag to the glm command is not present in 3.2.0a3. I'm not sure whether this is important, or just a problem with the test script, which is running a function from GLEON/GLM3r. In any case, I've pushed this to docker hub as jrossusgs/glm3r:v0.6_GLM_3.2.0a3 so it can be pulled into shifter on denali with shifterimg if you want to give it a try @hcorson-dosch ?

That no-gui flag is present again in 3.2.0a8, the most recent version, and the test script runs at this version. So I built an image off of that version too, jrossusgs/glm3r:v0.6_GLM_3.2.0a8.

Let me know if you want 3.2.0a6, happy to build that too, should only take a sec. The --no-gui flag which was missing from 3.2.0a3 is back by then, so the test from GLEON/GLM3r should probably work there, but I didn't build it because you had been having trouble with it.

I'll soon be committing the container recipe to this repository along with some instructions on doing the build. It's pretty easy. I am holding off on doing that for a few days, because I think we may be on the verge of a breakthrough in how we can manage containers and I may want to change the process a bit. But I'm happy to push instructions up as-is if it would be useful, i.e. if you want to build the container yourselves and don't mind the risk that the process will change.

jordansread commented 2 years ago

Thanks Jesse - my build script is similar to that, but I wasn't pinning the version of libplot. I agree that building from current sources is probably our best option at this point since there isn't a pattern of tagging in those repos.

hcorson-dosch-usgs commented 2 years ago

Great - thank you for all this work, Jesse. I'll test out those 3.2.0a3 and 3.2.0a8 shifter images on Denali with our workflow.

I think it's fine to hold off on committing the container recipe for a few days while you're reviewing best practices for managing our containers.

hcorson-dosch-usgs commented 2 years ago

Alright - so far, the model runs on Denali are failing with the 3.2.0a3 shifter image (Jordan, the glm_code is 0, so the error function of the TryCatch is being triggered by the max_output_date parameter, which is returning as NA which means it can't be extracted), but running fine with the 3.2.0a8 image. I'll try to dig more into why the 3.2.0a3 runs are failing. One note - I did have to bring over my temporary fix to the rain/snow units in order to get the 3.2.0a8 runs to succeed, which wasn't the case with version 3.1.

hcorson-dosch-usgs commented 2 years ago

Okay if I remove the line to delete the simulation directories and then manually try to run a model for one of the simulation directories, I get this error (looks like what Jesse was getting):

> GLM3r::run_glm('2_run/tmp/nhdhr_77358110_MRI_2080_2099', verbose = TRUE)
Cannot open default display
Unknown flag --no-gui

    -------------------------------------------------------
    |  General Lake Model (GLM)   Version 3.2.0a3         |
    -------------------------------------------------------

     glm built using gcc version 9.3.0
--help  : show this blurb
--nml <nmlfile> : get parameters from nmlfile
--xdisp : display temp/salt and selected others in x-window
--xdisp <plotsfile> : like --xdisp, but use <plotsfile> instead of plots.nml
--saveall : save plots to png files
--save-all-in-one : save all plots to png file
--save-all-in-one <destfile> : save all plots to png file <destfile>
--quiet   : less messages
--quiet <level> : set quiet level (1-10)
[1] 0
Warning message:
In glm.systemcall(sim_folder, glm_path, verbose, system.args) :
  Custom path to GLM executable set via 'GLM_PATH' environment variable as: /usr/local/bin/GLM/glm
jordansread commented 2 years ago

I think the model can still run without that flag, but I could be wrong. The PATH warning though makes me wonder if you are working off of a version of GLM3r prior to this PR: https://github.com/GLEON/GLM3r/pull/20. If you have pkg version 3.1.18 for GLM3r, then you are current and would include that update.

hcorson-dosch-usgs commented 2 years ago

I think the version of GLM3r must be = 3.1.18 because I was able to run the command GLM3r::glm_version(as_char = TRUE) in both the 3.2.0a3 and 3.2.0a8 shifter images. I've been waiting for an allocation to check for >1.5 hours, so will confirm when I get that allocation.

jesse-ross commented 2 years ago

Yes, it's GLM3r 3.1.18 for both of those images (if you have docker installed, this can be tested locally without needing to wait for HPC resources).

jross@IGSARMEWLTJROS:~$ docker run -it jrossusgs/glm3r:v0.6_GLM_3.2.0a3 Rscript -e 'packageVersion("GLM3r")'
[1] ‘3.1.18’
jross@IGSARMEWLTJROS:~$ docker run -it jrossusgs/glm3r:v0.6_GLM_3.2.0a8 Rscript -e 'packageVersion("GLM3r")'
[1] ‘3.1.18’
jordansread commented 2 years ago

ahh, yes. I see Jesse's dockerfile is building from the current/canonical GLM3r repo while Alison's was installing from a fork. All good!

hcorson-dosch-usgs commented 2 years ago

Quick update here. @jesse-ross just pushed what is hopefully a fixed version of GLM 3.2.0a3 to docker - it is called jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix. Thanks, Jesse! I'll give it a try when I next get an allocation on Denali.

jesse-ross commented 2 years ago

For the record, that bugfix build is defined here. For each of the libraries, I used the most recent commit which was prior to the bugfix commits to GLM and libplot. If we start using the container in production then I think we ought to move the container definitions into the main repo, but since we're still testing things it seems OK for it to stay where it is.

hcorson-dosch-usgs commented 2 years ago

I tried pulling the new image to Denali on 1/27, and got an error: image At the time I was also getting an error trying to pull the docker image for 3.2.0a8, which I had pulled previously: image

I tried again today, and still got an error, but was again able to pull the older images:

hcorson-dosch@nid00622:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models> module load shifter
hcorson-dosch@nid00622:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models> shifterimg pull docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix
2022-02-01T11:48:21 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:21 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:22 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:22 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:23 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:25 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:26 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:27 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:27 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:28 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:28 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:29 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:30 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:30 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:32 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:32 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:33 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, stat2022-02-01T11:48:33 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, status: PULLINGerr 28
hcorson-dosch@nid00622:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models> shifterimg pull docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a8
2022-02-01T11:48:59 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a8, status: PUL2022-02-01T11:48:59 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a8, status: READY
hcorson-dosch-usgs commented 2 years ago

Update - was just able to pull the new image (jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix)!

hcorson-dosch-usgs commented 2 years ago

Hmm I'm confused. If I try to build the p2_glm_uncalibrated_runs target with 3.2.0a3, the targets all seem to error (targets error, not an error caught by our tryCatch() statements):

hcorson-dosch@nid00393:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models> shifterimg pull docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix

2022-02-07T09:49:41 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfi2022-02-07T09:49:42 Pulling Image: docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix, status: READY
hcorson-dosch@nid00393:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models>
hcorson-dosch@nid00393:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models> shifter --image=docker:jrossusgs/glm3r:v0.6_GLM_3.2.0a3_bugfix /bin/bash
groups: cannot find name for group ID 1004
groups: cannot find name for group ID 1005
groups: cannot find name for group ID 1098
groups: cannot find name for group ID 5126
bash: /opt/cray/pe/modules/3.2.11.4/bin/modulecmd: No such file or directory
I have no name!@nid00393:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models$ R

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(targets)
> GLM3r::glm_version(as_char = TRUE)
Error in GLM3r::glm_version(as_char = TRUE) :
  "Version" not found in the expected message from GLM, try `as_char = FALSE`
In addition: Warning messages:
1: In glm.systemcall(sim_folder, glm_path, verbose, system.args) :
  Custom path to GLM executable set via 'GLM_PATH' environment variable as: /usr/local/bin/GLM/glm
2: In system2(glm_path, wait = TRUE, stdout = TRUE, stderr = NULL,  :
  running command ''/usr/local/bin/GLM/glm' --help 2>/dev/null' had status 139
> GLM3r::glm_version()
Segmentation fault
[1] 139
Warning message:
In glm.systemcall(sim_folder, glm_path, verbose, system.args) :
  Custom path to GLM executable set via 'GLM_PATH' environment variable as: /usr/local/bin/GLM/glm
> Sys.time()
[1] "2022-02-07 16:22:44 UTC"
>
> tar_make_clustermq(p2_glm_uncalibrated_runs, reporter='summary', workers=79)
queue | skip | start | built | error | warn | cancel |        time
    1 | 4873 |     0 |     0 |   959 |  959 |      0 | 16:27 03.36             Master: [242.1s 103.5% CPU]; Worker: [avg 14.4% CPU, max 922.5 Mb]
    0 | 4873 |     0 |     0 |   959 |  959 |      0 | 16:27 03.69

But then I tested a few failed runs directly, and got mixed results. We aren't seeing that Unknown flag --no-gui error we were previously, which is good, but odd that both models did run, yet one returned a seg fault error while one returned a successful code 0:

> GLM3r::run_glm('2_run/tmp/simulations/nhdhr_86443989_MIROC5_2080_2099', verbose=TRUE)
Cannot open default display

    -------------------------------------------------------
    |  General Lake Model (GLM)   Version 3.2.0a3         |
    -------------------------------------------------------

     glm built using gcc version 9.3.0
     build date 20220127-2249UTC

     Reading configuration from glm3.nml

     nDays= 150; timestep= 3600.000000 (s)
     NOTE: values for crest_elev not provided, assuming max elevation, H[bsn]
     Maximum lake depth is 5.000000
     Depth where flow will occur over the crest is 5.000000
     VolAtCrest= 3046924.55089; MaxVol= 3046924.55089 (m3)
     No 'sediment' section, turning off sediment heating
     WARNING: Initial profiles problem - expected 0 wd_init_vals entries but got 12

     Wall clock start time :  Mon Feb  7 16:37:15 2022

     Simulation begins...
     Running day  2488257, 100.00% of days complete

     Wall clock finish time : Mon Feb  7 16:37:19 2022
     Wall clock runtime was 4 seconds : 00:00:04 [hh:mm:ss]

    Model Run Complete
    -------------------------------------------------------

Segmentation fault
[1] 139
Warning message:
In glm.systemcall(sim_folder, glm_path, verbose, system.args) :
  Custom path to GLM executable set via 'GLM_PATH' environment variable as: /usr/local/bin/GLM/glm
> GLM3r::run_glm('2_run/tmp/simulations/nhdhr_114336515_ACCESS_2080_2099', verbose=TRUE)
Cannot open default display

    -------------------------------------------------------
    |  General Lake Model (GLM)   Version 3.2.0a3         |
    -------------------------------------------------------

     glm built using gcc version 9.3.0
     build date 20220127-2249UTC

     Reading configuration from glm3.nml

     nDays= 150; timestep= 3600.000000 (s)
     NOTE: values for crest_elev not provided, assuming max elevation, H[bsn]
     Maximum lake depth is 1.000000
     Depth where flow will occur over the crest is 1.000000
     VolAtCrest= 20263.25326; MaxVol= 20263.25326 (m3)
     No 'sediment' section, turning off sediment heating
     WARNING: Initial profiles problem - expected 0 wd_init_vals entries but got 12

     Wall clock start time :  Mon Feb  7 16:40:57 2022

     Simulation begins...
     Running day  2488257, 100.00% of days complete

     Wall clock finish time : Mon Feb  7 16:40:59 2022
     Wall clock runtime was 2 seconds : 00:00:02 [hh:mm:ss]

    Model Run Complete
    -------------------------------------------------------

[1] 0
Warning message:
In glm.systemcall(sim_folder, glm_path, verbose, system.args) :
  Custom path to GLM executable set via 'GLM_PATH' environment variable as: /usr/local/bin/GLM/glm
jesse-ross commented 2 years ago

@hcorson-dosch Just now seeing this - yuck! I wonder if there might be something wrong with the build. I tried to use the versions of all of the dependencies that would have been current at the time that it was committed, but possibly the person who compiled/tested it might have had some older versions.

I am not sure what to try next. Two thoughts come to mind.

  1. I'm not clear about the status of the 3.2.0a8 image. Does it work?
  2. @jread-usgs do you still have your build environment for the 3.2.0a3 build you used for the most recent GCM projections? If so, we could get the exact commits you had for the dependencies, and use them.
hcorson-dosch-usgs commented 2 years ago

Yes the 3.2.0a8 image does work, which is great. I think (and @jread-usgs correct me if I'm wrong), we were interested in running 3.2.0a3 a) because Jordan used it for the projections work he did last summer and b) Jordan has previously seen issues with the latest dev versions of GLM, so thought it would be worth testing with the older 3.2.0a3 to see if it resolves any of the run failures we're currently getting with 3.2.0a8

wdwatkins commented 2 years ago

I was using this image in a model archive example, and to rebuild it I had to roll back a few commits for libaed-water, adding this line to the Dockerfile after the clone:

cd libaed-water && git reset --hard df2f372916f79a3d573b54bb1ece551354c97680 && cd .. && \

I see there are a few releases in that repo now, might be best to just clone one of those if this image is going to be used longer term.

jesse-ross commented 2 years ago

At this point I'm guessing many of those libraries and packages will have moved along in various ways. If we want the build to be stable over time, and not just have a working image as an artifact of the time the build worked on 2022-06-15, we would probably want to fully specify the other stuff in addition to libaed-water, i.e.

This might take a few hours to do, because it's tricky to find the right commits for the AquaticEcoDynamics libraries, and I'm not sure whether it's worth it or not.

A simpler solution which might offer better stability would be to just build a more recent version altogether. All of the AquaticEcoDynamics repositories now have a v3.3.0 tag, which suggests a coordinated release at a known-working state. @hcorson-dosch and @lindsayplatt, has the current jrossusgs/glm3r:v0.7.1 image (a.k.a. glm3r_v0.7.1.sif) been working for you, or is it still buggy? Are there features you want in GLM 3.3.0, or known problems which would keep you from using it?