dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.34k stars 3k forks source link

[regression] DGL multi-gpu benchmarks fails on `g5.48xlarge` #7345

Closed Rhett-Ying closed 4 months ago

Rhett-Ying commented 4 months ago

🔨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

DGL multi-gpu examples fails on single GPU, 4 GPUs, 8GPUs. For num_gpus=0, no stderr is returned. for num_gpus=4, it crashed with below error:

<span class="awsui_body-cell-content_c6tup_98ns9_121"><span class=""><div data-testid="logs__log-events-table__formatted-message">              Training in benchmark mode using 4 GPU(s)</div><div class="logs__copy-button"><span data-analytics="copy-button" data-analytics-type="eventDetail" class="awsui_root_5a145_13ep8_9 awsui_root_ljpwc_1spew_5"><span class="awsui_root_xjuzf_1mnl8_828"><span class="awsui_trigger_xjuzf_1mnl8_864" id="1659-1713832474532-6092"></span></span></span></div></span></span>
Training in benchmark mode using 4 GPU(s)
--

for num_gpus=8, torch.cuda.device_count() returns 4 instead of 8. So it's skipped.

Skip because the number of GPUs available[4] is less than 8

Report

multi_gpu.bench_dgl_multigpu_node_classification.track_acc  'ogbn-products' 'cpu-cuda'  '0' None
multi_gpu.bench_dgl_multigpu_node_classification.track_acc  'ogbn-products' 'cpu-cuda'  '0,1,2,3'   None
multi_gpu.bench_dgl_multigpu_node_classification.track_time 'ogbn-products' 'cpu-cuda'  '0' None

Depending work items or issues

Rhett-Ying commented 4 months ago

root cause is found.

  1. first run requires more time than 600s to download dataset.
  2. visible GPUs are set as 4 instead of 8. I have removed all resource limit which results in all the resources of specified instance will be used.
mfbalin commented 4 months ago

It seems to be resolved now.

Rhett-Ying commented 4 months ago

resolved.