aiqm / torchani

Accurate Neural Network Potential on PyTorch
https://aiqm.github.io/torchani/
MIT License
442 stars 125 forks source link

Can't Allocate Memory Issue #604

Open bartdemooij opened 2 years ago

bartdemooij commented 2 years ago

Dear,

What would be the best way to perform high-performance molecular dynamics with ANI on a cluster? We run torchANI in combination with ASE. Currently, when running a box of 1000 ethanol molecules gives the following error when performing the BFGS optimisation:

warnings.warn( Traceback (most recent call last): File "/home/bmooij/ANI_quality_check/MD_ethanol_quality_check_ANI.py", line 49, in <module> opt.run(fmax=1.0) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 269, in run return Dynamics.run(self) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 156, in run for converged in Dynamics.irun(self): File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 122, in irun self.atoms.get_forces() File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces forces = self._calc.get_forces(self) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces return self.get_property('forces', atoms) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property self.calculate(atoms, [name], system_changes) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/ase.py", line 82, in calculate energy = self.model((species, coordinates), cell=cell, pbc=pbc).energies File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/models.py", line 106, in forward species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 533, in forward aev = compute_aev(species, coordinates, self.triu_index, self.constants(), self.sizes, (cell, shifts)) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 288, in compute_aev atom_index12, shifts = neighbor_pairs(species == -1, coordinates_, cell, shifts, Rcr) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 171, in neighbor_pairs shifts_all = torch.cat([shifts_center, shifts_outside]) RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 26243892000 bytes. Error code 12 (Cannot allocate memory)

It works fine if the box has few molecules in it (i.e. 125 ethanol), but starts to give this error for larger systems (i.e. 750 or 1000 ethanol). A system of 500 ethanol also seems to work, but is terribly slow.

Some reproducible code (where file 'ethanol_1000.pdb' is a box of 1000 ethanol made with packmol):

from ase import Atoms
from ase.optimize import BFGS
import torch
import torchani
from Bio import PDB

#Load box of ethanol
parser = PDB.PDBParser()
io = PDB.PDBIO()
struct = parser.get_structure('ethanol_1000', 'ethanol_1000.pdb')
pos = []
for model in struct:
    for chain in model:
        for residue in chain:
            for atom in residue:
                x,y,z = atom.get_coord()
                pos.append([x,y,z])
ethanol = Atoms(1000*"COH3CH3", positions=pos)
ethanol.set_cell((46, 46, 46))
ethanol.set_pbc(True)

#Setup calculator
calculator = torchani.models.ANI1ccx().ase()
ethanol.set_calculator(calculator)

#Minimize the structure
print("Begin minimizing...")
opt = BFGS(ethanol)
opt.run(fmax=1.0)
print()

Best regards,

Bart

isayev commented 2 years ago

Dear @bartdemooij , thanks for reporting. It seems your system is too large to fit into your GPU memory. At this point, we implemented only direct algorithms that would be memory-bound depending on matrix sizes. Therefore you have two choices: i) reduce system size, ii) get a GPU with more memory.

We are working to make simulations plugins into LAMMPS and AMBER, with domain decomposition you would be able to run much larger systems in a distributed fashion.

bartdemooij commented 2 years ago

Dear @isayev, thanks for the swift reply. So we ran this system on CPU with 64gb ram. Would you say it is to be expected that this all get used up by a 1000 ethanol molecules (9000 atoms)? Perhaps this is a trivial question, but in what way does memory usage scale with system size? We are now looking into performing simulations with torchani using openmm-ml, do you think memory usage is more friendly here?

isayev commented 2 years ago

Since your code invokes a CUDA memory error, I would assume you need to check your run script and check its correctness. It seems to be still running on a GPU. Typical suspects are CUDA_VISIBLE_DEVICES variable and torch.device definition in your code.

zubatyuk commented 2 years ago

Bart: Current memory scale is O(N^2) since TorchANI code calculates NxN distance matrix to find neighbors. In the case of PBC, the code builds extra images (in your case of cubit cell, it would be 18 cells) to find all neighbors. This is the stage when you run out of memory.

On Fri, Dec 10, 2021 at 11:47 AM Olexandr Isayev @.***> wrote:

Since your code invokes a CUDA memory error, I would assume you need to check your run script and check its correctness. It seems to be still running on a GPU. Typical suspects are CUDA_VISIBLE_DEVICES variable and torch.device definition in your code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aiqm/torchani/issues/604#issuecomment-991128578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7FDQ2EXJCJPDLQ5HVYQVDUQIVKVANCNFSM5JQ7332Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bartdemooij commented 2 years ago

Thank you, the memory error is clear.

@zubatyuk The only thing I don't understand is how you get 18 cells for PBC in three dimensions as I thought it would be 26. Am I right that you get 18 by 3x3x3 - 1(original) - 8(the corner boxes) = 18. If this is the case, why are you allowed to omit the corner boxes? If not, how do you get 18 images?

zubatyuk commented 2 years ago

Sorry, it was clearly my mistake. 3x3x3 is 27 indeed.

On Fri, Dec 17, 2021 at 4:44 AM bartdemooij @.***> wrote:

Thank you, the memory error is clear.

@zubatyuk https://github.com/zubatyuk The only thing I don't understand is how you get 18 cells for PBC in three dimensions as I thought it would be 26. Am I right that you get 18 by 3x3x3 - 1(original) - 8(the corner boxes) = 18. If this is the case, why are you allowed to omit the corner boxes? If not, how do you get 18 images?

— Reply to this email directly, view it on GitHub https://github.com/aiqm/torchani/issues/604#issuecomment-996576694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7FDQ4GZMA5IHV6YVPKU4TURMA7JANCNFSM5JQ7332Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>