MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.29k stars 646 forks source link

Warn users when unrecognised atom numbers are found in TPR files (leading to unassigned elements) #3502

Open haroldgrosjean opened 2 years ago

haroldgrosjean commented 2 years ago

Expected behavior

I would like to load a trajectory using gmx tpr and xtc files to later transform the ligand into an rdkit object.

Actual behavior

When I extract the ligand atom group and proceed with the transformation the following error is returned:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_6674/3347617922.py in <module>
      2 u.add_TopologyAttr('elements', u.atoms.types)
      3 lig = u.atoms.select_atoms("resname MOL")
----> 4 lig.convert_to("RDKIT")

~/anaconda3/envs/autochem/lib/python3.7/site-packages/MDAnalysis/core/accessors.py in __call__(self, package, *args, **kwargs)
    202             raise ValueError(f"No {package!r} converter found. Available: "
    203                              f"{' '.join(self._CONVERTERS.keys())}") from None
--> 204         return convert(*args, **kwargs)

~/anaconda3/envs/autochem/lib/python3.7/site-packages/MDAnalysis/converters/RDKit.py in convert(self, obj, cache, NoImplicit, max_iter, force)
    312         kwargs = dict(NoImplicit=NoImplicit, max_iter=max_iter, force=force)
    313         if cache:
--> 314             mol = atomgroup_to_mol(ag, **kwargs)
    315             mol = copy.deepcopy(mol)
    316         else:

~/anaconda3/envs/autochem/lib/python3.7/site-packages/MDAnalysis/converters/RDKit.py in atomgroup_to_mol(ag, NoImplicit, max_iter, force)
    372         elif NoImplicit:
    373             raise AttributeError(
--> 374                 "No hydrogen atom could be found in the topology, but the "
    375                 "converter requires all hydrogens to be explicit. You can use "
    376                 "the parameter ``NoImplicit=False`` when using the converter "

AttributeError: No hydrogen atom could be found in the topology, but the converter requires all hydrogens to be explicit. You can use the parameter ``NoImplicit=False`` when using the converter to allow implicit hydrogens and disable inferring bond orders and charges. You can also use ``force=True`` to ignore this error.

I have processed ligands in that fashion before and without problems so I am a little bit unsure how to resolve the situation. I thank you very much in advance for your assistance.

Code to reproduce the behavior

prod.tpr.txt


tpr = './prod.tpr.txt'
u = mda.Universe(tpr)
lig = u.select_atoms("resname MOL")
lig_rdk = lig.convert_to("RDKIT")

Current version of MDAnalysis

lilyminium commented 2 years ago

Hmm, that's interesting. Not sure how you put the system together, but I'm guessing that the elements of the not-ligand are adequately guessed from the atomic number field, and the ligand either has none or invalid numbers given:

>>> lig.atoms.elements
array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       ''], dtype=object)

You could probably get around this by just guessing elements from the names:

>>> lig.atoms.elements = [re.sub("[0-9]+", "", name) for name in lig.atoms.names]
>>> lig.atoms.elements
array(['C', 'O', 'N', 'C', 'C', 'N', 'C', 'O', 'N', 'C', 'C', 'C', 'C',
       'N', 'C', 'O', 'C', 'O', 'C', 'C', 'C', 'H', 'H', 'H', 'H', 'H',
       'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'], dtype=object)
>>> lig_rdk = lig.convert_to("RDKIT")
>>> lig_rdk.GetNumAtoms()
35
haroldgrosjean commented 2 years ago

That worked well for the few ligands I've tried so fare, thanks!

IAlibay commented 2 years ago

Do we need to consider maybe raising a user warning when elements aren't found? Other topology parsers do this but TPR doesn't?

lilyminium commented 2 years ago

They are found here, they're just not valid (''). Possibly from the gromacs atomtype atomic number, which might be set to 0. Notably, the protein elements are fine.

IAlibay commented 2 years ago

They are found here, they're just not valid (''). Possibly from the gromacs atomtype atomic number, which might be set to 0. Notably, the protein elements are fine.

not found == unrecognisable I really mean both cases here