CederGroupHub / chgnet

Pretrained universal neural network potential for charge-informed atomistic modeling https://chgnet.lbl.gov
https://doi.org/10.1038/s42256-023-00716-3
Other
226 stars 61 forks source link

[Bug]: MPS out of memory error when using cpu device #181

Closed ElliottKasoar closed 1 month ago

ElliottKasoar commented 1 month ago

Email (Optional)

No response

Version

v0.3.8

Which OS(es) are you using?

What happened?

I am attempting to run calculations, such as single point energies, using the CHGNetCalculator ASE calculator on MacOS via GitHub actions: https://github.com/stfc/janus-core/actions/runs/9943461072/job/27469030924?pr=214

However, I get a RuntimeError, which I believe is because CHGnet.load is not passed device / use_device (https://github.com/CederGroupHub/chgnet/blob/main/chgnet/model/dynamics.py#L92).

This means that within CHGnet.load, determine_device initially attempts to load the model to the "mps" device, causing an error as no MPS memory has been allocated.

Two potential solutions are checking the availability of MPS memory, or preferably passing device to CHGnet.load, reducing the need to transfer between devices unnecessarily.

Code snippet

from chgnet.model.dynamics import CHGNetCalculator
CHGNetCalculator(use_device="cpu")

Log output

tests/test_mlip_calculators.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
janus_core/helpers/mlip_calculators.py:104: in choose_calculator
    calculator = CHGNetCalculator(use_device=device, **kwargs)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/chgnet/model/dynamics.py:91: in __init__
    self.model = (model or CHGNet.load(verbose=False)).to(self.device)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/chgnet/model/model.py:722: in load
    model = model.to(device)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py:1152: in to
    return self._apply(convert)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py:802: in _apply
    module._apply(fn)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py:802: in _apply
    module._apply(fn)
../../../Library/Caches/pypoetry/virtualenvs/janus-core-2KE8lRKs-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py:825: in _apply
    param_applied = fn(param)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

t = Parameter containing:
tensor([[ -3.4431,  -0.1279,  -2.8300,  -3.4737,  -7.4946,  -8.2354,  -8.1611,
          -8.3861...          -0.3448,  -0.4364,  -0.1661,  -0.3680,  -4.1869,  -8.4233, -10.0467,
         -12.0953, -12.5228, -14.2530]])

    def convert(t):
        if convert_to_format is not None and t.dim() in (4, 5):
            return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
                        non_blocking, memory_format=convert_to_format)
>       return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
E       RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 512 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Code of Conduct

BowenD-UCB commented 1 month ago

Hi, thanks for the report! This should be fixed in 6f7b035e5c8385040896298603b1df9b93793d6a