Transfer all (or parts) of this workflow to Yeti

jordansread commented 4 years ago

The TOHA and thermal metric calculations are not scalable on a local machine, so we need this workflow to support running these metrics on Yeti (or Denali or pangeo).

Challenges:

some packages are hard to install on Yeti
authentication to sciencebase will be needed, or another way to automatically/robustly get the files from SB to the compute

The most straightforward way I can think of for this is to use interactive mode on Yeti on normal with a preset number of nodes (40?), initialize an R session, authenticate the cidamanager account to sbtools, and run the task table jobs with loop_tasks(..., n_cores = 40). Details below:

getting a new working directory on Yeti and cloning a repo [note that I've done this already to save time]

clone the repo from github
make sure permissions allow other team members to work in this directory (I don't really know how to do this reliably...)

installing needed packages [note that I've done this already to save time]

use vim .Renviron to create and edit an R environment file, which will be used to specify where we want to install the packages. In vim, type "i" for insert access and type the library location. I used this:R_LIBS=/cxfs/{full_path_here}/lake-temperature-process-models/Rlib_3_6. Then "esc" and ":wq" to write this file and quit vim

in terminal:

module load legacy R/3.6.3 tools/nco-4.7.8-gnu tools/netcdf-c-4.3.2-intel

Then

to start R

# In R now, install packages we'll need and also remotes so we can install github packages, like scipiper and remake
install.packages(c('remotes','tidyverse','feather','sbtools','jsonlite', "R6", "yaml", "digest", "crayon", "optparse", "storr"))

remotes::install_github('richfitz/remake')
remotes::install_github('USGS-R/scipiper')

using interactive:

make sure you belong to the watertemp group -or- use iidd (or cida) in place of it below:

salloc -A watertemp -n 4 -p normal -t 7:00:00

this ☝️ gives you 4 cores on normal for 7 hours. You probably want way more than 4, but this is a start. Then ssh into the node you are given, and from there, go to the working directory

ssh n3-98
cd /cxfs/{full_path_here}/lake-temperature-process-models/

Load modules as before:

module load legacy R/3.6.3 tools/nco-4.7.8-gnu tools/netcdf-c-4.3.2-intel

then start R

Now you are in R but on a big cluster, so the number of cores you have available is much greater than on your own machine (unless you only asked for 4 cores...)

library(scipiper)
sbtools::authenticate_sb('cidamanager')
scmake('3_summarize/out/annual_metrics_pgdl.csv')

but code will need to be modified so that you use loop tasks and also so you can specify the number of cores loop tasks is using...

If the job fails or you are kicked off Yeti, no worries, as remake/scipiper will pick back up where you left off in the task table. 🎉

jordansread commented 4 years ago

Note that the only reason we need interactive mode (vs sbatch) is because we need to authenticate to sb. If we had another way of doing that, we could run all of this without needing to be interactive. Or we could do certain phases w/ interactive (like download).

aappling-usgs commented 4 years ago

make sure permissions allow other team members to work in this directory (I don't really know how to do this reliably...)

I emailed hpc@usgs.gov about this issue in October. Here's what I wrote:

I’ve been collaborating with Hayley Corson-Dosch (cc’ed) in the /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-neural-networks directory. We’re each editing files and running scripts that create files in that directory. It seems that most/all times, when one of us makes a change, the permissions change for the other of us. For example, I was the first person to work in this directory, and when Hayley started working there, I had to add write permissions for her to be able to successfully run R scripts that modified those files. Today, I modified some files that Hayley had touched more recently, and (1) vim told me they were read-only, so I had to force-write them, and then (2) after I force-wrote them, Hayley no longer had write access until I added w permissions to both g (group) and o (other) with chmod.

Are these conflicts and changes expected? Can you demystify any of their logic for us? Even better, is there any way for us to tell linux just to let either of us modify any of these files at any time?

And here's what Brad Williams said:

The issue is that when you create a directory or file, linux uses your umask to determine the permissions. DOI Security requires us to set the default umask to '0022' which will result in rw for owner, and read for the group. If you want the group to have write permissions the owner will need to run

chmod g+w <folder or filename>

We have set the top level folder to use the groupid of the parent folder so new folders will keep the group. ( you will notice files / folders in your home directory are assigned to the group 'users' ).

There really isn't a good solution for this. One way to accomplish this would be for users to set their umask to 002

umask 002 That would then set 775 for everything they do. This can be dangerous and not recommended for long term use. The umask will revert to 022 the next time they login.

If directories / files are created by a script you could set the umask in the script

Hope that helps.

jordansread commented 4 years ago

Thanks @aappling-usgs !

aappling-usgs commented 4 years ago

Or we could do certain phases w/ interactive (like download).

It would indeed be good HPC citizenship to ssh to yeti-dtn.cr.usgs.gov to do the interactive SB login and data transfers.(Or ssh username@caldera-dtn.cr.usgs.gov if using Denali.)

lindsayplatt commented 4 years ago

It would indeed be good HPC citizenship to ssh to yeti-dtn.cr.usgs.gov to do the interactive SB login and data transfers.

@aappling-usgs could you explain this further? I followed the pattern of ssh to yeti.cr.usgs.gov, salloc, ssh to node, and then start up R. What should I be doing differently?

lindsayplatt commented 4 years ago

Progress from today:

Successfully got write access to Yeti directory and ran scmake("1_fetch") without issue.
Edited process.yml to loop through only one task name for pb0 TOHA. Ran scmake("2_process/out/2_process_lake_tasks.ind") to kick off build. Hit snag with missing packages during build - mda.lakes, rLakeAnalyzer, foreach.
When trying to install mda.lakes and rLakeAnalyzer (or if I try to reinstall scipiper), I get a Peer Certificate error:

> remotes::install_github("USGS-R/mda.lakes")
Error: Failed to install 'unknown package' from GitHub:
  Peer's Certificate issuer is not recognized.

> remotes::install_github("USGS-R/scipiper")
Error: Failed to install 'scipiper' from GitHub:
  Peer's Certificate issuer is not recognized.

aappling-usgs commented 4 years ago

It would indeed be good HPC citizenship to ssh to yeti-dtn.cr.usgs.gov to do the interactive SB login and data transfers.

@aappling-usgs could you explain this further? I followed the pattern of ssh to yeti.cr.usgs.gov, salloc, ssh to node, and then start up R. What should I be doing differently?

Just ssh yeti-dtn.cr.usgs.gov instead of ssh yeti.cr.usgs.gov, and then don't use salloc to get a node - just run the data-getting processes right on the yeti-dtn1 or yet-dtn2 node (whichever you're given). You'll need to run module load legacy R/3.6.3 and be in the project directory to get your Rlibs. And then run R interactively so you can do the sblogin().

aappling-usgs commented 4 years ago

Peer's Certificate issuer is not recognized. is about that same old DOI SSL interception thing: https://github.com/usgs/best-practices/blob/master/ssl/WorkingWithinSSLIntercept.md. I forget how to deal with it, though.

jordansread commented 4 years ago

@lindsayplatt wasn't scipiper already installed? I can look into mda.lakes. That one is a pain to install on Yeti

lindsayplatt commented 4 years ago

I tried scipiper as a test because I knew it was already installed. I also hit that SSL issue when trying to install rLakeAnalyzer and foreach.

lindsayplatt commented 4 years ago

@aappling-usgs if I am not using salloc, how do I know how many cores I have available?

jordansread commented 4 years ago

I would not recommend using multiple cores for the download step. I think you should run it single threaded from the DTN (data transfer node) as Alison suggested. With data locally on Yeti, then spreading out the jobs to different nodes is :+1:

Perhaps a quick chat to clarify this concept would be useful?

lindsayplatt commented 4 years ago

Yes, I think so. I don't think I am following

jordansread commented 4 years ago

Related to the cert - I am able to install, so I am just doing the installs for the missing packages. I don't understand the cert issue but this is a way around it for the time being

jordansread commented 4 years ago

Note to future self that I needed this for ncdf4 install (was failing with other intel netcdf module)

module list

Currently Loaded Modules:
  1) prun/1.3     3) openmpi3/3.1.4      5) ohpc     7) openblas/0.3.7   9) tools/netcdf-c-4.6.2-gnu
  2) gnu8/8.3.0   4) slurm_scripts/0.4   6) legacy   8) R/3.6.3

lindsayplatt commented 4 years ago

I just did a test run for TOHA of using n_cores = 4 with loop_tasks and it worked for 4 lakes! Each was finished within 30 seconds of each other. I'll work on getting a PR up for TOHA and then run it.

lindsayplatt commented 4 years ago

When we go to actually run this, should we request more than just 4 cores? 7K lakes at 5 minutes per lake (but with 4 going at a time) is still 145 hours .... although, running the 7K lakes isn't happening just yet. With 4 cores, the 881 lakes will take about 20 hrs

jordansread commented 4 years ago

Yes, you'd want to request many more than 4. I'd hope that this is faster than 5 mins per lake if you are on UV (perhaps 2 mins?) but either way, 40 cores would be a good start. Did you catch how long the jobs took on UV this time around?

lindsayplatt commented 4 years ago

I didn't

lindsayplatt commented 4 years ago

Friday, I kicked off scmake("2_process/out/2_process_lake_tasks.ind") which creates TOHA for pb0 data. In 2 hours, there were 877 2_process/tmp/pb0_toha_{lake id}.csv files. I thought that there were supposed to be 881, so I am investigating which are missing right now.

lindsayplatt commented 4 years ago

OK, ignore the above. I had completed the other 4 in a quick test of using n_cores and forgot to count those in that number. All 881 were completed in ~2 hrs.

lindsayplatt commented 4 years ago

All 881 annual metrics files for pb0 were built in 2 hrs using 40 cores. See https://github.com/USGS-R/lake-temperature-out/pull/37

lindsayplatt commented 3 years ago

^ I have since added about double the number of temperature ranges used in the thermal metrics, which increased the time each lake took to process. I think that figuring out a faster way to do any of the metrics that output a tibble and must get unpacked would be beneficial to speeding up.

Thermal metrics and TOHA are setup to work with 40 cores. Plus, there is an additional pipeline yaml that will automatically upload to ScienceBase. Calling this completed and will put enhancements elsewhere.

DOI-USGS / lake-temperature-out