Intractable graph size scaling for large proteins

tristanic commented 3 weeks ago

Hi,

As I understand it one of the end goals of ESPALOMA is to be able to parameterise an entire system without the need for individual residue templates (what got me excited about it in the first place - the promise of straightforward handling of new covalent modifications to protein or DNA residues is particularly alluring). Unfortunately it looks like that won't be possible with the current implementation for any but the smallest proteins, due to the scaling in size of the heterograph with number of atoms. Reading in a protein from PDB with:

 import espaloma as esp
 from openff.toolkit import Topology
 top = Topology.from_pdb('protein.pdb')
 mol = top.molecule(0)
 mgraph = esp.Graph(mol)

My first attempt (with a ~700-residue protein) was killed by the Linux OOM killer after chewing > 22 GB of system RAM. Trying with a series of smaller poly-A models (replication files attached):

import espaloma as esp
from openff.toolkit import Topology

with open('graph_size.csv', 'wt') as out:
    print('Residues,Atoms,Nodes,Edges', file=out)
    for res_count in (5,10,20,30,40,50,75,100):
        top = Topology.from_pdb(f'ala{res_count}.pdb')
        mol = top.molecule(0)
        mgraph = esp.Graph(mol)
        het = mgraph.heterograph
        print(f'{res_count},{mol.n_atoms},{het.number_of_nodes()},{het.number_of_edges()}', file=out)

... shows the node and edge count both scaling as O(n**2) - a 1,003-atom model gives a heterograph with just over a million nodes and 6 million edges. This seems excessive to me, but I don't yet understand enough about the architecture to know the reasons for it. Extrapolating out, a (still pretty reasonably-sized) 10k-atom protein would yield a graph with about 100 million nodes and 600 million edges.

espaloma_graph_size

Can you shed some light on what's going on, and do you have any ideas on how to improve on this?

espaloma_polya.tar.gz

diogomart commented 3 weeks ago

We have developmental code to represent each residue as an individual RDKit molecule, like a "chorizo" in which each residue is one of the links. Residues carry a few extra atoms to model the chemistry of adjacent residues, as well as lists of atom indices to keep track of what's "real" and what's padding. It would be a bit of work to generalize it but as it stands we get at least espaloma charges for entire proteins and nucleic acids.

tristanic commented 3 weeks ago

That's one sensible solution, which may in fact be preferable in an interactive environment like where I want to apply it, since it'd cut down the cost of reparameterising after modifications to a large protein... just re-do the affected region, rather than the whole thing. But I'm more curious about whether there are ways to improve this scaling in the first place... without digging deeply into the code, a naive interpretation of this would suggest that each individual atom is getting its own all-atom subgraph. That doesn't feel right to me, but it's totally possible I'm missing something fundamental.

ijpulidos commented 2 weeks ago

@tristanic I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing.

In private communications with @yuanqing-wang I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now.

tristanic commented 6 days ago

@ijpulidos thanks for the feedback. We'll have a think about what to do next.

On Tue, Jun 11, 2024 at 10:27 PM Iván Pulido @.***> wrote:

@tristanic https://github.com/tristanic I can reproduce your results, and indeed it's creating pretty big graphs. I checked the code that's consuming most of the the memory and it boils down to this line https://github.com/choderalab/espaloma/blob/cb8e5b23e3ec1ada356128debc6a2a5511ef0b98/espaloma/graphs/utils/read_heterogeneous_graph.py#L272. Unfortunately, I don't see how we could make that line consume significantly less memory, especially with the restrictions that DGL is already imposing.

In private communications with @yuanqing-wang https://github.com/yuanqing-wang I think he has proposed ways to modify the architecture for this to be more memory efficient, but I don't think that's a quick fix/thing to do right now.

— Reply to this email directly, view it on GitHub https://github.com/choderalab/espaloma/issues/217#issuecomment-2161624848, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YFWHT23EN3XHBFHSXDZG5TV3AVCNFSM6AAAAABI2VVZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGYZDIOBUHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

choderalab / espaloma

Intractable graph size scaling for large proteins #217