[regression] DGL multi-gpu benchmarks fails on `g5.48xlarge`

Rhett-Ying commented 4 months ago

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

DGL multi-gpu examples fails on single GPU, 4 GPUs, 8GPUs. For num_gpus=0, no stderr is returned. for num_gpus=4, it crashed with below error:

<span class="awsui_body-cell-content_c6tup_98ns9_121"><span class=""><div data-testid="logs__log-events-table__formatted-message">              Training in benchmark mode using 4 GPU(s)</div><div class="logs__copy-button"><span data-analytics="copy-button" data-analytics-type="eventDetail" class="awsui_root_5a145_13ep8_9 awsui_root_ljpwc_1spew_5"><span class="awsui_root_xjuzf_1mnl8_828"><span class="awsui_trigger_xjuzf_1mnl8_864" id="1659-1713832474532-6092"></span></span></span></div></span></span>
Training in benchmark mode using 4 GPU(s)
--

for num_gpus=8, torch.cuda.device_count() returns 4 instead of 8. So it's skipped.

Skip because the number of GPUs available[4] is less than 8

Report

multi_gpu.bench_dgl_multigpu_node_classification.track_acc  'ogbn-products' 'cpu-cuda'  '0' None
multi_gpu.bench_dgl_multigpu_node_classification.track_acc  'ogbn-products' 'cpu-cuda'  '0,1,2,3'   None
multi_gpu.bench_dgl_multigpu_node_classification.track_time 'ogbn-products' 'cpu-cuda'  '0' None

Depending work items or issues

Rhett-Ying commented 4 months ago

root cause is found.

first run requires more time than 600s to download dataset.
visible GPUs are set as 4 instead of 8. I have removed all resource limit which results in all the resources of specified instance will be used.

mfbalin commented 4 months ago

It seems to be resolved now.

Rhett-Ying commented 4 months ago

resolved.

dmlc / dgl

[regression] DGL multi-gpu benchmarks fails on `g5.48xlarge` #7345

🔨Work Item

Description

Depending work items or issues