Shared node for benching

sbryngelson commented 6 months ago

Idea here is that many benchmark jobs fail because we need an entire node to benchmark on only 1 or 2 GPUs. Taking over the whole node is ideal for benchmarking, but my view is that our testing should be mostly robust to someone else performing periphery tasks on a node. This update lets us share nodes and gets us into the queue much quicker (at least that's my experience, we will see how the CI runs). We will also use other runners (like RG Violet/Quorra) for this purpose so we will have multiple points of contact for performance.

sbryngelson commented 6 months ago

Update: Looks like my new submit script is maybe giving us two nodes, each with one GPU each.

login-phoenix-slurm-1: 6/sbryngelson3 $ scontrol show job 4681047
JobId=4681047 JobName=MFC-bench-gpu
   UserId=sbryngelson3(3048356) GroupId=p-sbryngelson3(451953) MCS_label=N/A
   Priority=22 Nice=0 Account=gts-sbryngelson3 QOS=embers
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:02:30 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=02:16:57 EligibleTime=02:16:57
   AccrueTime=Unknown
   StartTime=02:17:00 EndTime=06:17:00 Deadline=N/A
   PreemptEligibleTime=03:17:00 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=02:17:00 Scheduler=Main
   Partition=gpu-v100 AllocNode:Sid=login-phoenix-slurm-1:190474
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atl1-1-02-003-36-0,atl1-1-02-007-31-0
   BatchHost=atl1-1-02-003-36-0
   NumNodes=2 NumCPUs=24 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=16G,node=1,billing=17098,gres/gpu=2
   AllocTRES=cpu=24,mem=96G,node=2,billing=17098,gres/gpu=2,gres/gpu:v100=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=V100-16GB DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr
   StdErr=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   StdIn=/dev/null
   StdOut=/storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/bench-gpu.out
   Power=
   CpusPerTres=gpu:12
   TresPerJob=gres/gpu:2

Not sure how much this matters at the moment.

sbryngelson commented 6 months ago

@henryleberre this PR is failing but the error is not clear to me. The logs seem fine. Several parts of this seem quite fragile.

 Comparing Bencharks: master/bench-cpu.yaml is x times slower than pr/bench-cpu.yaml.
 Warning: Metadata of lhs and rhs are not equal.

mfc: ERROR > mfc.py finished with a 1 exit code.
mfc: (venv) Exiting the Python virtual environment.
Error: Process completed with exit code 1.

MFlowCode / MFC

Shared node for benching #326