dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.35k stars 3k forks source link

[GraphBolt] ItemSampler CPU usage too high, especially hetero case. #7315

Open mfbalin opened 4 months ago

mfbalin commented 4 months ago

🔨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

When running the hetero graphbolt example in the pureGPU mode, the CPU utilization is very high. (4000%)

image image

Depending work items or issues

Rhett-Ying commented 4 months ago

As @mfbalin mentioned, specific logic for ItemsetDict could be the culprit.

mfbalin commented 4 months ago

It now looks like dgl.create_block is the culprit.

mfbalin commented 4 months ago

dgl/heterograph.py:6407 make_canonical_edges uses numpy for some ops.

Rhett-Ying commented 4 months ago

https://github.com/dmlc/dgl/blob/41a38486a5ed9298093d9f0bc415751269c7d577/python/dgl/convert.py#L583

Rhett-Ying commented 4 months ago

@peizhou001 can dgl.create_block() purely run on GPU or it can only run purely on CPU instead? I remember you looked into it previously.

Rhett-Ying commented 4 months ago

@mfbalin tried to bypass the whole forward including data.blocks and CPU usage is still high. So the create_blocks() is probably not the culprit.

mfbalin commented 4 months ago

Update: CPU usage high even for the homo examples. Some recent change might have caused us to utilize the CPU even in the pureGPU mode. @frozenbugs do you think it could be the logic to move MiniBatch to device?

mfbalin commented 4 months ago

Or could it possibly be one of my recent changes, such as #7312?

Oh the code in #7312 does not run in the homo case.

I am going to bisect to see if I can identify a commit that causes this issue.

mfbalin commented 4 months ago

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

Rhett-Ying commented 4 months ago

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

could be https://github.com/dmlc/dgl/pull/7309 @yxy235 could you help look into it? Reproduce and confirm it?

mfbalin commented 4 months ago

Easiest way to test is to run python examples/sampling/graphbolt/pyg/node_classification_advanced.py --torch-compile --mode=cuda-cuda-cuda. There is upto %30 regression.

mfbalin commented 4 months ago

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds
mfbalin commented 4 months ago

Looks like blocks is called inside MiniBatch.to() even for the pyg example.

yxy235 commented 4 months ago

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds

I see. Do you think we need a check when call Minibatch.to()?

mfbalin commented 4 months ago

I figured it out. When we filter which attributes to transfer, we end up calling blocks property. Making a quick patch now.

mfbalin commented 4 months ago

CPU usage still higher than 100% though, so I am not sure if I resolved the whole issue.

mfbalin commented 4 months ago

Even with #7330, we need to investigate where the high CPU usage comes from. CPU usage is 800% for our main pure-gpu (--mode=cuda-cuda) node classification example.

mfbalin commented 4 months ago

hetero example CPU usage is still 4000%

mfbalin commented 4 months ago

@Rhett-Ying Here, we can find see the last iterations of training dataloader for the hetero example. Since we have a prefetcher thread with a buffer size 2, the last 2 iterations don't have excessive CPU utilization as the computation for the last 2 iterations has already finished. This indicates that the high CPU utilization is due to the ItemSampler.

        # (4) Cut datapipe at CopyTo and wrap with prefetcher. This enables the
        # data pipeline up to the CopyTo operation to run in a separate thread.
        datapipe_graph = _find_and_wrap_parent(
            datapipe_graph,
            CopyTo,
            dp.iter.Prefetcher,
            buffer_size=2,
        )

image

mfbalin commented 4 months ago

Users with multiple GPUs may not be able to utilize the GPUs effectively due to potential CPU bottleneck.

mfbalin commented 1 month ago

python examples/graphbolt/pyg/labor python node_classification.py --dataset=yelp --dropout=0 --mode=cuda-cuda-cuda CPU usage on this example is too high too, and this is the homo case. @Rhett-Ying Becomes the bottleneck, faster CPU results into faster performance even if the GPU is slower. 10000% CPU usage.

Rhett-Ying commented 1 month ago

@mfbalin The culprit of high CPU usage is ItemSampler?