DOI-USGS / lake-temperature-lstm-static

Predict lake temperatures at depth using static lake attributes
Other
0 stars 3 forks source link

Tallgrass GPU bug fixes #29

Closed AndyMcAliley closed 2 years ago

AndyMcAliley commented 2 years ago

This PR addresses issues that arose when trying to train a model using a GPU on Tallgrass. Closes #30. The changes can be organized into four categories:

  1. Training data and validation data are now on the same device (cpu or gpu) as the model.
    • Shouldn't need much attention. This issue caused a show-stopping error, and now it doesn't.
    • I could ask Jeremy how he's handled this after the temperature sprint ends.
  2. Missing packages added to conda environment
    • The weird addition here is git. Conda will now install the git executable. This is so that the git status can be logged - turns out that compute nodes don't have git installed on them. I'm open to other options, but this does work!
  3. Slurm scripts for both cpu and gpu have been written
    • I'd appreciate eyes here - I'm not fluent in slurm scripts, plus there's snakemake in the mix.
    • I adapted the command in Jeff's blog for the snakemake part.
  4. Miscellaneous minor changes
    • Add setup.sh to setup the environment easily
    • Git ignore the log folder where slurm output is directed
    • Add flush=True to Python print() commands to flush stdout and keep the logs up to date while the model is training. Otherwise it's hard to track the progress during training.

How to review this PR

This one's a hodgepodge! Since it consists mainly of bug fixes that were needed to get a model out, it's been through a round of informal testing. Don't worry if there are parts that aren't in your wheelhouse - feel free to focus on the aspects of the PR that you can easily review, and let me know if there are aspects that you skim over.

Where in the code to focus

Feedback on the slurm scripts would be especially welcome!

AndyMcAliley commented 2 years ago

Yes, @hcorson-dosch, I'm seeing 2x comments on both this pull request and #32