Open nikolayilyin opened 1 year ago
sacctmgr show association -p user=$USER
- accounts, partitions and the QoSs that available for current user
sacctmgr show qos <qos name>
- show information about the QoS
sinfo --long --partition=lr_bigmem
- In order to see information about partition
sinfo -N --long --partition=lr_bigmem
- In order to see information about nodes of selected partition
list of partitions is https://it.lbl.gov/resource/hpc/lawrencium/
srun <shell script> # will run a job in interactive way
sbatch <shell script> # will run a job in a background
squeue – check the current jobs in the batch queue system
i.e. squeue -u $USER
execute for a current user only
sinfo – view the current status of the queues
scancel <job id> # cancel a job
to see all jobs (completed\cancelled\running) for a user starting some date:
sacct -u $USER --format=JobID,JobName%30,state,start,end,elapsed,nnodes,ncpus,nodelist,user,partition,maxrss,maxvmsize,time -S 2023-03-2
Specifying SLURM parameters in the shell script might not work (with #SBATCH comments), in that case it is possible to specify them directly for a command.
For example to run test-simple-job.sh
script on partition ood_inter
as account pc_beamcore
:
srun -p ood_inter -A pc_beamcore test-simple-job.sh
srun documentation is here
The shell file containing a job should be marked as executable (chmod +x
)
Storage: default user directory - /global/home/users/$USER, limited to 20G user temporary storage - /global/scratch/users/$USER has no disc quota, intended for short term use and should be considered volatile. Backups are not performed on this file system. Data is subject to periodic purge policy wherein any files which are not accessed with in the last 14 days will be deleted.
Partitions we are able to use for production beam runs, are the following:
cm1 -> 14 nodes with ~230G or RAM, 48 CPU cf1 -> 72 nodes with ~180G of RAM, 64 CPU es1 -> 14 (out of 47) nodes with ~500G of RAM, 64 CPU lr6 -> 256 (out of 324) nodes with ~187G of RAM, 40 CPU (nodes often busy, waiting time was about 30 minutes once)
the limit of max wall time for a job seems to be 3 days for all currently available QoS - 3-00:00:00 (3 days, 0 hours, 0 minutes, 0 seconds)
comparison of lawrencium nodes - https://docs.google.com/spreadsheets/d/1JaLOa42qR9qBUgmvMX2XgTh2Z8R43Ng7UHitd9t_PQ8/edit?usp=sharing
base functionality - https://github.com/LBNL-UCB-STI/beam/pull/3759
To authorize: login: [login], password: [pin code]+[one-time password] Both login and pin code are set during initial user setup, one-time passwords are generated each time with an application for that, i.e. "google authenticator"
SSH in browser window: https://lrc-ondemand.lbl.gov/pun/sys/shell/ssh/default
Documentation: https://it.lbl.gov/resource/hpc/for-users/