carbonscott / exp-peaknet

Run peaknet experiments
0 stars 1 forks source link

Set up training on Perlmutter at NERSC #13

Closed carbonscott closed 3 weeks ago

carbonscott commented 3 weeks ago

Use srun as the launcher: srun --ntasks-per-node=4 ... python train.py

Refer to page 22 at https://docs.google.com/presentation/d/1FB2vqlibSWECRsCOFK2tMr_jT_PVyM06nAgmIR5qzhE/edit#slide=id.g29a556e7c6f_1_67

carbonscott commented 3 weeks ago

NCCL issues were also reported https://github.com/NVIDIA/nccl/issues/1024 and https://github.com/hiyouga/LLaMA-Factory/issues/1169

carbonscott commented 3 weeks ago

Just use pytorch 2.0.1 for now.