NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.94k stars 3.12k forks source link

[ELECTRA/TensorFlow2] Minor: README Invokes Slurm sbatch With Incorrect Parameter? #1325

Open psharpe99 opened 12 months ago

psharpe99 commented 12 months ago

Related to ELECTRA/TensorFlow2

Describe the bug The README in the MultiNode section says

BATCHSIZE=176 LR=6e-3 GRAD_ACCUM_STEPS=1 PHASE=1 STEPS=10000 WARMUP=2000 b1=0.878 b2=0.974 decay=0.5 skip_adaptive=yes end_lr=0.0 sbatch N48 --ntasks-per-node=8 run.sub BATCHSIZE=24 LR=4e-3 GRAD_ACCUM_STEPS=3 PHASE=2 STEPS=930 WARMUP=200 b1=0.878 b2=0.974 decay=0.5 skip_adaptive=yes end_lr=0.0 sbatch N48 --ntasks-per-node=8 run.sub

I think that this should be "-N48": the slurm sbatch manpage has

sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...] 
     :
-N, --nodes=<minnodes>[-maxnodes]|<size_string>
    Request that a minimum of minnodes nodes be allocated to this job. 

The README command as given would assume that "N48" is actually a script-name, rather than an option.

To Reproduce N/A

Expected behavior N/A

Environment N/A