Closed jordansread closed 3 years ago
Note that the only reason we need interactive mode (vs sbatch
) is because we need to authenticate to sb. If we had another way of doing that, we could run all of this without needing to be interactive. Or we could do certain phases w/ interactive (like download).
make sure permissions allow other team members to work in this directory (I don't really know how to do this reliably...)
I emailed hpc@usgs.gov about this issue in October. Here's what I wrote:
I’ve been collaborating with Hayley Corson-Dosch (cc’ed) in the /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-neural-networks directory. We’re each editing files and running scripts that create files in that directory. It seems that most/all times, when one of us makes a change, the permissions change for the other of us. For example, I was the first person to work in this directory, and when Hayley started working there, I had to add write permissions for her to be able to successfully run R scripts that modified those files. Today, I modified some files that Hayley had touched more recently, and (1) vim told me they were read-only, so I had to force-write them, and then (2) after I force-wrote them, Hayley no longer had write access until I added
w
permissions to bothg
(group) ando
(other) withchmod
.Are these conflicts and changes expected? Can you demystify any of their logic for us? Even better, is there any way for us to tell linux just to let either of us modify any of these files at any time?
And here's what Brad Williams said:
The issue is that when you create a directory or file, linux uses your umask to determine the permissions. DOI Security requires us to set the default umask to '0022' which will result in rw for owner, and read for the group. If you want the group to have write permissions the owner will need to run
chmod g+w <folder or filename>
We have set the top level folder to use the groupid of the parent folder so new folders will keep the group. ( you will notice files / folders in your home directory are assigned to the group 'users' ).
There really isn't a good solution for this. One way to accomplish this would be for users to set their umask to 002
umask 002 That would then set 775 for everything they do. This can be dangerous and not recommended for long term use. The umask will revert to 022 the next time they login.
If directories / files are created by a script you could set the umask in the script
Hope that helps.
Thanks @aappling-usgs !
Or we could do certain phases w/ interactive (like download).
It would indeed be good HPC citizenship to ssh
to yeti-dtn.cr.usgs.gov
to do the interactive SB login and data transfers.(Or ssh username@caldera-dtn.cr.usgs.gov
if using Denali.)
It would indeed be good HPC citizenship to ssh to yeti-dtn.cr.usgs.gov to do the interactive SB login and data transfers.
@aappling-usgs could you explain this further? I followed the pattern of ssh to yeti.cr.usgs.gov, salloc, ssh to node, and then start up R. What should I be doing differently?
Progress from today:
scmake("1_fetch")
without issue.process.yml
to loop through only one task name for pb0 TOHA. Ran scmake("2_process/out/2_process_lake_tasks.ind")
to kick off build. Hit snag with missing packages during build - mda.lakes, rLakeAnalyzer, foreach.> remotes::install_github("USGS-R/mda.lakes")
Error: Failed to install 'unknown package' from GitHub:
Peer's Certificate issuer is not recognized.
> remotes::install_github("USGS-R/scipiper")
Error: Failed to install 'scipiper' from GitHub:
Peer's Certificate issuer is not recognized.
It would indeed be good HPC citizenship to ssh to yeti-dtn.cr.usgs.gov to do the interactive SB login and data transfers.
@aappling-usgs could you explain this further? I followed the pattern of ssh to yeti.cr.usgs.gov, salloc, ssh to node, and then start up R. What should I be doing differently?
Just ssh yeti-dtn.cr.usgs.gov
instead of ssh yeti.cr.usgs.gov
, and then don't use salloc
to get a node - just run the data-getting processes right on the yeti-dtn1 or yet-dtn2 node (whichever you're given). You'll need to run module load legacy R/3.6.3
and be in the project directory to get your Rlibs. And then run R interactively so you can do the sblogin()
.
Peer's Certificate issuer is not recognized.
is about that same old DOI SSL interception thing: https://github.com/usgs/best-practices/blob/master/ssl/WorkingWithinSSLIntercept.md. I forget how to deal with it, though.
@lindsayplatt wasn't scipiper
already installed? I can look into mda.lakes
. That one is a pain to install on Yeti
I tried scipiper as a test because I knew it was already installed. I also hit that SSL issue when trying to install rLakeAnalyzer
and foreach
.
@aappling-usgs if I am not using salloc
, how do I know how many cores I have available?
I would not recommend using multiple cores for the download step. I think you should run it single threaded from the DTN (data transfer node) as Alison suggested. With data locally on Yeti, then spreading out the jobs to different nodes is :+1:
Perhaps a quick chat to clarify this concept would be useful?
Yes, I think so. I don't think I am following
Related to the cert - I am able to install, so I am just doing the installs for the missing packages. I don't understand the cert issue but this is a way around it for the time being
Note to future self that I needed this for ncdf4
install (was failing with other intel netcdf module)
module list
Currently Loaded Modules:
1) prun/1.3 3) openmpi3/3.1.4 5) ohpc 7) openblas/0.3.7 9) tools/netcdf-c-4.6.2-gnu
2) gnu8/8.3.0 4) slurm_scripts/0.4 6) legacy 8) R/3.6.3
I just did a test run for TOHA of using n_cores = 4
with loop_tasks
and it worked for 4 lakes! Each was finished within 30 seconds of each other. I'll work on getting a PR up for TOHA and then run it.
When we go to actually run this, should we request more than just 4 cores? 7K lakes at 5 minutes per lake (but with 4 going at a time) is still 145 hours .... although, running the 7K lakes isn't happening just yet. With 4 cores, the 881 lakes will take about 20 hrs
Yes, you'd want to request many more than 4. I'd hope that this is faster than 5 mins per lake if you are on UV (perhaps 2 mins?) but either way, 40 cores would be a good start. Did you catch how long the jobs took on UV this time around?
I didn't
Friday, I kicked off scmake("2_process/out/2_process_lake_tasks.ind")
which creates TOHA for pb0 data. In 2 hours, there were 877 2_process/tmp/pb0_toha_{lake id}.csv
files. I thought that there were supposed to be 881, so I am investigating which are missing right now.
OK, ignore the above. I had completed the other 4 in a quick test of using n_cores
and forgot to count those in that number. All 881 were completed in ~2 hrs.
All 881 annual metrics files for pb0 were built in 2 hrs using 40 cores. See https://github.com/USGS-R/lake-temperature-out/pull/37
^ I have since added about double the number of temperature ranges used in the thermal metrics, which increased the time each lake took to process. I think that figuring out a faster way to do any of the metrics that output a tibble and must get unpack
ed would be beneficial to speeding up.
Thermal metrics and TOHA are setup to work with 40 cores. Plus, there is an additional pipeline yaml that will automatically upload to ScienceBase. Calling this completed and will put enhancements elsewhere.
The TOHA and thermal metric calculations are not scalable on a local machine, so we need this workflow to support running these metrics on Yeti (or Denali or pangeo).
Challenges:
The most straightforward way I can think of for this is to use interactive mode on Yeti on
normal
with a preset number of nodes (40?), initialize an R session, authenticate thecidamanager
account to sbtools, and run the task table jobs withloop_tasks(..., n_cores = 40)
. Details below:getting a new working directory on Yeti and cloning a repo [note that I've done this already to save time]
installing needed packages [note that I've done this already to save time]
use
vim .Renviron
to create and edit an R environment file, which will be used to specify where we want to install the packages. In vim, type "i" for insert access and type the library location. I used this:R_LIBS=/cxfs/{full_path_here}/lake-temperature-process-models/Rlib_3_6
. Then "esc" and ":wq" to write this file and quit vimin terminal:
Then
to start R
using interactive:
make sure you belong to the
watertemp
group -or- useiidd
(orcida
) in place of it below:this ☝️ gives you 4 cores on normal for 7 hours. You probably want way more than 4, but this is a start. Then ssh into the node you are given, and from there, go to the working directory
Load modules as before:
then start R
Now you are in R but on a big cluster, so the number of cores you have available is much greater than on your own machine (unless you only asked for 4 cores...)
but code will need to be modified so that you use loop tasks and also so you can specify the number of cores loop tasks is using...
If the job fails or you are kicked off Yeti, no worries, as remake/scipiper will pick back up where you left off in the task table. 🎉