GPU slower than CPU - Githubissues

xnuohz commented 3 years ago

Tried to use dgllife model_zoo to extract molecule feature and found that running speed on GPU was much slower than on CPU. It's hard to train a model on 3090

	GPU	CPU
T4	0.3637	0.0535
3090	93.5422	0.01018

my env is as follows:

python 3.7
torch 1.7.0
dgl-cu101 0.6.1
dgllife 0.2.8

mufeili commented 3 years ago

Which model did you use? Can you provide a minimal script and data file to reproduce the issue? There's chance that the CPU-to-GPU copy is much more costly than the computation itself. A larger batch size might help then.

xnuohz commented 3 years ago

import dgl
import torch
from dgllife.utils import smiles_to_bigraph, AttentiveFPAtomFeaturizer, AttentiveFPBondFeaturizer
# from modules import MoleculeGNN
from dgllife.model.model_zoo.mpnn_predictor import MPNNPredictor

config = {
    'node_feat': AttentiveFPAtomFeaturizer(),
    'edge_feat': AttentiveFPBondFeaturizer(self_loop=True)
}

device = 'cuda:0'
# device = 'cpu'

smiles_lst = [
    'B1(C2=C(C=C(C=C2CO1)OC3=C(C=C(C(=N3)OCCOC(C)C)C#N)Cl)C)O'
]

gs = [smiles_to_bigraph(smiles,
                        add_self_loop=True,
                        node_featurizer=config['node_feat'],
                        edge_featurizer=config['edge_feat']) for smiles in smiles_lst]

gs = dgl.batch(gs).to(device)

model = MPNNPredictor(node_in_feats=config['node_feat'].feat_size(),
                      edge_in_feats=config['edge_feat'].feat_size()).to(device)

warmup_runs = 3
total_runs = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(warmup_runs):
    res = model(gs, gs.ndata['h'], gs.edata['e'])
torch.cuda.synchronize()
start_event.record()
for _ in range(warmup_runs, total_runs):
    res = model(gs, gs.ndata['h'], gs.edata['e'])
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)

print(elapsed_time_ms, res.size())

# 3090 gpu(ms): 69.52, cpu: 80.62

yzh119 commented 3 years ago

@xnuohz you are not using the correct way of measuring GPU time, please read this section on how to profile GPU code correctly: https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution

xnuohz commented 3 years ago

I've updated the code, is it right?

yzh119 commented 3 years ago

It's correct now, but you need to set a warm-up time (this is because GPU has launching overhead, and it could be amortized in multiple runs).

For example:

warmup_runs = 3
total_runs = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(warmup_runs):
    res = model(gs, gs.ndata['h'], gs.edata['e'])
torch.cuda.synchronize()
start_event.record()
for _ in range(warmup_runs, total_runs):
    res = model(gs, gs.ndata['h'], gs.edata['e'])
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)

xnuohz commented 3 years ago

It seems the launching time is much longer than I expect.

awslabs / dgl-lifesci

GPU slower than CPU #150