Limited speed boost when increasing `batch_size` in `model.predict_structure()`

mwolinska commented 1 year ago

I'm using model.predict_structure() on a list of structures, but I am seeing less speed boost (2x vs linear interpolation of 1 structure) than I would expect from the evaluation of a neural network. How did you benchmark to get the default batch size? What can I do to speed up the evaluation for a large number of structures?

I am using the fast graph converter.

Code to reproduce:

    model = CHGNet.load()
    with MPRester() as mpr:
        one_structure = mpr.get_structure_by_material_id("mp-1341203", final=True)

    atoms_for_ref = AseAtomsAdaptor.get_atoms(one_structure)
    atoms_for_ref.rattle(0.1)
    batch_sizes_to_test = [0.05, 0.1, 0.2]
    n_individuals_to_test = [10, 20, 50, 100, 200]

    structure = AseAtomsAdaptor.get_structure(copy.deepcopy(atoms_to_test))

    all_timings = []
    for batch_size_percent in batch_sizes_to_test:
        timings_per_batch_size = []
        for population_size in n_individuals_to_test:
            batch_size = max(int(batch_size_percent * population_size), 1)
            structures = [copy.deepcopy(structure) for i in range(population_size)]
            tic = time.time()
            model.predict_structure(copy.deepcopy(structures), batch_size=batch_size)
            time_taken = time.time() - tic
            timings_per_batch_size.append(time_taken)
        all_timings.append(timings_per_batch_size)

Thank you for your help.

BowenD-UCB commented 1 year ago

Hi,

This is a very interesting topic. Below I have done a speed benchmark on chgnet.predict_structure() The main conclusion here is: As long as batch_size is larger than 2, the time/structure remains almost similar.

Since in a realistic dataset, there're various size structures that might cause GPU memory explosion, my advice is to set batch_size to a safe range depending on your GPU memory and structure sizes.

import time

import matplotlib.pyplot as plt
from chgnet.model import CHGNet

from pymatgen.core import Structure

model = CHGNet.load()
model_gpu = CHGNet.load().to("cuda")
one_structure = Structure.from_file("./tmp.cif")
structure_list = [one_structure] * 100

batch_sizes = [1, 2, 10, 20, 50]
time_per_structure_cpu = []
time_per_structure_cuda = []

for batch_size in batch_sizes:
    print(batch_size)
    start = time.perf_counter()
    model.predict_structure(structure_list, batch_size=batch_size)
    t = (time.perf_counter() - start) / len(structure_list)
    time_per_structure_cpu.append(t)

    start = time.perf_counter()
    model_gpu.predict_structure(structure_list, batch_size=batch_size)
    t = (time.perf_counter() - start) / len(structure_list)
    time_per_structure_cuda.append(t)

fig = plt.figure(figsize=((6, 4)))
ax = fig.add_subplot()
ax.set(title="CHGNet batch_size", xlabel="batch size", ylabel="time per structure (s)")

ax.plot(batch_sizes, time_per_structure_cpu, label="cpu")
ax.plot(batch_sizes, time_per_structure_cuda, label="cuda")

plt.show()

janosh commented 1 year ago

Running CHGNet CPU locally on an M2 Max chip with variable-sized structures, I get a different behavior:

chgnet_batch_size_vs_time_per_struct

import time
from random import randint

import pandas as pd
from chgnet.model import CHGNet

from pymatgen.core import Lattice, Structure

model = CHGNet.load()
# model_gpu = CHGNet.load().to("cuda")

struct = Structure(
    lattice=Lattice.cubic(3),
    species=("Fe", "Fe"),
    coords=((0, 0, 0), (0.5, 0.5, 0.5)),
)

structure_list = [struct.make_supercell([randint(1, 3) for _ in range(3)], in_place=False) for _ in list(range(100))]

pd.Series(map(len, structure_list)).hist(bins=100)

batch_sizes = [1, 2, 10, 20, 50]
time_per_structure_cpu = []
time_per_structure_cuda = []

for batch_size in batch_sizes:
    print(batch_size)
    start = time.perf_counter()
    model.predict_structure(structure_list, batch_size=batch_size)
    t_per_struct = (time.perf_counter() - start) / len(structure_list)
    time_per_structure_cpu.append(t_per_struct)

    # start = time.perf_counter()
    # model_gpu.predict_structure(structure_list, batch_size=batch_size)
    # t_per_struct = (time.perf_counter() - start) / len(structure_list)
    # time_per_structure_cuda.append(t_per_struct)

ax = pd.Series(time_per_structure_cpu, index=batch_sizes).plot()
ax.set(title="CHGNet batch_size", xlabel="batch size", ylabel="time per structure (s)")
ax.figure.savefig("chgnet_batch_size_vs_time_per_struct.svg")

BowenD-UCB commented 1 year ago

@janosh I have reproduced your result on Apple M2, which also gives a lowest inference time at batch_size = 10. Here is the result running your code on CUDA. Looks like M2 has some different mechanism that causes this result. I think we should update our default predict_structurebatch_size to 20 ?

janosh commented 1 year ago

I think we should update our default predict_structure batch_size to 20 ?

Makes sense! Maybe 16.

mwolinska commented 1 year ago

Hi both, thank you so much for your thoughts and quick replies! I did my benchmarking locally on cpu, as I didn't have access to gpu - will try this today.

When you say that some structure sizes can cause memory explosion on GPU, what kind of sizes are you thinking about?

BowenD-UCB commented 1 year ago

@mwolinska

Say you have 10GB GPU memory that can support ~ 3000 atoms. If your dataset contains structures with 20 to 100 atoms, then you should consider setting batch_size to at most 30. A good way is to monitor the GPU usage while training by nvidia-smi

mwolinska commented 1 year ago

Hi @BowenD-UCB that makes sense, thank you for the advice!

mwolinska commented 1 year ago

One last question @BowenD-UCB - I find that when I increase number of structures I see an almost linear relationship in the time required model.predict_structure is this expected? for 10, 100, 200 structures I got 1.2s, 7.35s, 12.01s respectively

BowenD-UCB commented 1 year ago

@mwolinska Yes this is expected. Since the computation cost of MLP like CHGNet scales linearly with number of atoms.

mwolinska commented 1 year ago

Okay, thank you!

CederGroupHub / chgnet

Limited speed boost when increasing `batch_size` in `model.predict_structure()` #56