Closed mwolinska closed 1 year ago
Hi,
This is a very interesting topic.
Below I have done a speed benchmark on chgnet.predict_structure()
The main conclusion here is:
As long as batch_size
is larger than 2, the time/structure remains almost similar.
Since in a realistic dataset, there're various size structures that might cause GPU memory explosion, my advice is to set batch_size to a safe range depending on your GPU memory and structure sizes.
import time
import matplotlib.pyplot as plt
from chgnet.model import CHGNet
from pymatgen.core import Structure
model = CHGNet.load()
model_gpu = CHGNet.load().to("cuda")
one_structure = Structure.from_file("./tmp.cif")
structure_list = [one_structure] * 100
batch_sizes = [1, 2, 10, 20, 50]
time_per_structure_cpu = []
time_per_structure_cuda = []
for batch_size in batch_sizes:
print(batch_size)
start = time.perf_counter()
model.predict_structure(structure_list, batch_size=batch_size)
t = (time.perf_counter() - start) / len(structure_list)
time_per_structure_cpu.append(t)
start = time.perf_counter()
model_gpu.predict_structure(structure_list, batch_size=batch_size)
t = (time.perf_counter() - start) / len(structure_list)
time_per_structure_cuda.append(t)
fig = plt.figure(figsize=((6, 4)))
ax = fig.add_subplot()
ax.set(title="CHGNet batch_size", xlabel="batch size", ylabel="time per structure (s)")
ax.plot(batch_sizes, time_per_structure_cpu, label="cpu")
ax.plot(batch_sizes, time_per_structure_cuda, label="cuda")
plt.show()
Running CHGNet CPU locally on an M2 Max chip with variable-sized structures, I get a different behavior:
import time
from random import randint
import pandas as pd
from chgnet.model import CHGNet
from pymatgen.core import Lattice, Structure
model = CHGNet.load()
# model_gpu = CHGNet.load().to("cuda")
struct = Structure(
lattice=Lattice.cubic(3),
species=("Fe", "Fe"),
coords=((0, 0, 0), (0.5, 0.5, 0.5)),
)
structure_list = [struct.make_supercell([randint(1, 3) for _ in range(3)], in_place=False) for _ in list(range(100))]
pd.Series(map(len, structure_list)).hist(bins=100)
batch_sizes = [1, 2, 10, 20, 50]
time_per_structure_cpu = []
time_per_structure_cuda = []
for batch_size in batch_sizes:
print(batch_size)
start = time.perf_counter()
model.predict_structure(structure_list, batch_size=batch_size)
t_per_struct = (time.perf_counter() - start) / len(structure_list)
time_per_structure_cpu.append(t_per_struct)
# start = time.perf_counter()
# model_gpu.predict_structure(structure_list, batch_size=batch_size)
# t_per_struct = (time.perf_counter() - start) / len(structure_list)
# time_per_structure_cuda.append(t_per_struct)
ax = pd.Series(time_per_structure_cpu, index=batch_sizes).plot()
ax.set(title="CHGNet batch_size", xlabel="batch size", ylabel="time per structure (s)")
ax.figure.savefig("chgnet_batch_size_vs_time_per_struct.svg")
@janosh
I have reproduced your result on Apple M2, which also gives a lowest inference time at batch_size = 10.
Here is the result running your code on CUDA.
Looks like M2 has some different mechanism that causes this result.
I think we should update our default predict_structure
batch_size to 20 ?
I think we should update our default
predict_structure
batch_size
to 20 ?
Makes sense! Maybe 16.
Hi both, thank you so much for your thoughts and quick replies! I did my benchmarking locally on cpu, as I didn't have access to gpu - will try this today.
When you say that some structure sizes can cause memory explosion on GPU, what kind of sizes are you thinking about?
@mwolinska
Say you have 10GB GPU memory that can support ~ 3000 atoms. If your dataset contains structures with 20 to 100 atoms, then you should consider setting batch_size
to at most 30.
A good way is to monitor the GPU usage while training by nvidia-smi
Hi @BowenD-UCB that makes sense, thank you for the advice!
One last question @BowenD-UCB - I find that when I increase number of structures I see an almost linear relationship in the time required model.predict_structure
is this expected?
for 10, 100, 200 structures I got 1.2s, 7.35s, 12.01s respectively
@mwolinska Yes this is expected. Since the computation cost of MLP like CHGNet scales linearly with number of atoms.
Okay, thank you!
I'm using
model.predict_structure()
on a list of structures, but I am seeing less speed boost (2x vs linear interpolation of 1 structure) than I would expect from the evaluation of a neural network. How did you benchmark to get the default batch size? What can I do to speed up the evaluation for a large number of structures?I am using the fast graph converter.
Code to reproduce:
Thank you for your help.