Universe.get_Molecules() method

Foly93 commented 1 year ago

Hello everybody,

I am working with amber and I wanted to create a reduced system, partially with the help of MDAnalysis. The problem that I ran into, is that ag.write('test.pdb') would not identify the termini of the molecules correctly which led to problems later on. Ofc, one could manually pick the molecules via u.select_atoms('..'), however, that is not very robust for various systems. Therefore, imo a get_Molecules() method would make sense, that identifies molecules in the system. The idea is to assign the atoms of these molecules to individual chains that would then also be printed to the PDB file. It was shown, that the initial problem got solved that way.

I came up with some lines to take care of that. However, for my system (50000 atoms) it takes about 11 minutes, which is somewhat lenghty.

molecules = []
bonds = np.array([bond.atoms.ids for bond in u.bonds])

for atom in tqdm(u.atoms.ids):
    if atom not in list(itertools.chain.from_iterable(molecules)):
        molecule = [atom]

        for bond in bonds:
            if len(list(set(molecule) & set(bond))):
                molecule.extend(list(bond))

        molecules.append(list(set(molecule)))
        for atom in set(molecule):
            bonds = bonds[~(bonds==atom).any(axis=(1))]

but it works.

An even better alternative would be a function called guess_chains(), that would assign water (n_atoms > 1 and n_atoms < 10) and ions (n_atoms == 1) chain X and macromolecules or ligands chain A,B,C etc. Thanks for reading and thanks for any help.

Cheers!

lilyminium commented 1 year ago

Hi @Foly93, thanks for the suggestion. I think historically the trajectories we read fall under two categories:

there are bonds included, for which we can use u.atoms.fragments to get individually bonded molecules
there are not bonds included, so we need to start using atom distances and vdw radii to guess bonds (not very robust, but we do provide a method for it -- u.atoms.guess_bonds())

Either way, once you do have bonds (and it sounds like your universe does) -- you could use .fragments to get molecules, I think. This topology attribute is computed by associating molecules together by bonds. e.g.

>>> import MDAnalysis as mda
>>> from MDAnalysis.tests.datafiles import TPR, XTC

>>> u = mda.Universe(TPR, XTC)  # protein and solvent molecules
>>> u.atoms.fragments
(<AtomGroup with 3341 atoms>,
 <AtomGroup with 3 atoms>,
 <AtomGroup with 1 atom>,
 <AtomGroup with 3 atoms>,
 <AtomGroup with 1 atom>,
 <AtomGroup with 3 atoms>,
 <AtomGroup with 1 atom>,
...
)

I think you could use the number of atoms here to guess which molecules are solvent vs ions.

An even better alternative would be a function called guess_chains(), that would assign water (n_atoms > 1 and n_atoms < 10) and ions (n_atoms == 1) chain X and macromolecules or ligands chain A,B,C etc.

I can see how that could be very valuable for trajectories within the biological domain, but it sounds like it makes many assumptions that would make it difficult to generalise outside of it. If .fragments is sufficiently convenient, it may be easier to work with that and assign solvent/ions/ligands etc individually for each system.

orbeckst commented 1 year ago

@Foly93 were you able to use fragments?

Foly93 commented 1 year ago

Hi @orbeckst, yes it worked! thank you very much.

orbeckst commented 1 year ago

Thanks for the feeback @Foly93 ! Given that fragments works, I am closing this issue. Please feel free to request it to be re-opened if necessary.

MDAnalysis / mdanalysis

Universe.get_Molecules() method #4222