Closed xnuohz closed 3 years ago
Which model did you use? Can you provide a minimal script and data file to reproduce the issue? There's chance that the CPU-to-GPU copy is much more costly than the computation itself. A larger batch size might help then.
import dgl
import torch
from dgllife.utils import smiles_to_bigraph, AttentiveFPAtomFeaturizer, AttentiveFPBondFeaturizer
# from modules import MoleculeGNN
from dgllife.model.model_zoo.mpnn_predictor import MPNNPredictor
config = {
'node_feat': AttentiveFPAtomFeaturizer(),
'edge_feat': AttentiveFPBondFeaturizer(self_loop=True)
}
device = 'cuda:0'
# device = 'cpu'
smiles_lst = [
'B1(C2=C(C=C(C=C2CO1)OC3=C(C=C(C(=N3)OCCOC(C)C)C#N)Cl)C)O'
]
gs = [smiles_to_bigraph(smiles,
add_self_loop=True,
node_featurizer=config['node_feat'],
edge_featurizer=config['edge_feat']) for smiles in smiles_lst]
gs = dgl.batch(gs).to(device)
model = MPNNPredictor(node_in_feats=config['node_feat'].feat_size(),
edge_in_feats=config['edge_feat'].feat_size()).to(device)
warmup_runs = 3
total_runs = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(warmup_runs):
res = model(gs, gs.ndata['h'], gs.edata['e'])
torch.cuda.synchronize()
start_event.record()
for _ in range(warmup_runs, total_runs):
res = model(gs, gs.ndata['h'], gs.edata['e'])
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
print(elapsed_time_ms, res.size())
# 3090 gpu(ms): 69.52, cpu: 80.62
@xnuohz you are not using the correct way of measuring GPU time, please read this section on how to profile GPU code correctly: https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution
I've updated the code, is it right?
It's correct now, but you need to set a warm-up time (this is because GPU has launching overhead, and it could be amortized in multiple runs).
For example:
warmup_runs = 3
total_runs = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(warmup_runs):
res = model(gs, gs.ndata['h'], gs.edata['e'])
torch.cuda.synchronize()
start_event.record()
for _ in range(warmup_runs, total_runs):
res = model(gs, gs.ndata['h'], gs.edata['e'])
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
It seems the launching time is much longer than I expect.
Tried to use dgllife model_zoo to extract molecule feature and found that running speed on GPU was much slower than on CPU. It's hard to train a model on 3090
my env is as follows: