LBNL-UCB-STI / beam

The Framework for Modeling Behavior, Energy, Autonomy, and Mobility in Transportation Systems
https://transportation.lbl.gov/beam
Other
145 stars 57 forks source link

Cluster: use BEAM with Lawrencium cluster #3732

Open nikolayilyin opened 1 year ago

nikolayilyin commented 1 year ago

To authorize: login: [login], password: [pin code]+[one-time password] Both login and pin code are set during initial user setup, one-time passwords are generated each time with an application for that, i.e. "google authenticator"

SSH in browser window: https://lrc-ondemand.lbl.gov/pun/sys/shell/ssh/default

Documentation: https://it.lbl.gov/resource/hpc/for-users/

nikolayilyin commented 1 year ago

sacctmgr show association -p user=$USER - accounts, partitions and the QoSs that available for current user sacctmgr show qos <qos name> - show information about the QoS sinfo --long --partition=lr_bigmem - In order to see information about partition sinfo -N --long --partition=lr_bigmem - In order to see information about nodes of selected partition

list of partitions is https://it.lbl.gov/resource/hpc/lawrencium/

nikolayilyin commented 1 year ago

srun <shell script> # will run a job in interactive way sbatch <shell script> # will run a job in a background squeue – check the current jobs in the batch queue system i.e. squeue -u $USER execute for a current user only sinfo – view the current status of the queues scancel <job id> # cancel a job

to see all jobs (completed\cancelled\running) for a user starting some date: sacct -u $USER --format=JobID,JobName%30,state,start,end,elapsed,nnodes,ncpus,nodelist,user,partition,maxrss,maxvmsize,time -S 2023-03-2

nikolayilyin commented 1 year ago

Specifying SLURM parameters in the shell script might not work (with #SBATCH comments), in that case it is possible to specify them directly for a command. For example to run test-simple-job.sh script on partition ood_inter as account pc_beamcore:

srun -p ood_inter -A pc_beamcore test-simple-job.sh

srun documentation is here The shell file containing a job should be marked as executable (chmod +x)

nikolayilyin commented 1 year ago

Storage: default user directory - /global/home/users/$USER, limited to 20G user temporary storage - /global/scratch/users/$USER has no disc quota, intended for short term use and should be considered volatile. Backups are not performed on this file system. Data is subject to periodic purge policy wherein any files which are not accessed with in the last 14 days will be deleted.

nikolayilyin commented 1 year ago

Partitions we are able to use for production beam runs, are the following:

cm1 -> 14 nodes with ~230G or RAM, 48 CPU cf1 -> 72 nodes with ~180G of RAM, 64 CPU es1 -> 14 (out of 47) nodes with ~500G of RAM, 64 CPU lr6 -> 256 (out of 324) nodes with ~187G of RAM, 40 CPU (nodes often busy, waiting time was about 30 minutes once)

the limit of max wall time for a job seems to be 3 days for all currently available QoS - 3-00:00:00 (3 days, 0 hours, 0 minutes, 0 seconds)

nikolayilyin commented 11 months ago

comparison of lawrencium nodes - https://docs.google.com/spreadsheets/d/1JaLOa42qR9qBUgmvMX2XgTh2Z8R43Ng7UHitd9t_PQ8/edit?usp=sharing

nikolayilyin commented 11 months ago

base functionality - https://github.com/LBNL-UCB-STI/beam/pull/3759