MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.28k stars 642 forks source link

Discuss/document bond duplicate behaviour #3818

Open hmacdope opened 1 year ago

hmacdope commented 1 year ago

In the course of moving forward on #3777 the question of what to do with duplicate bond entries has come up.

The key problem is that this is a valid input for adding bonds:

u.add_bonds([[0,1], [1,0]], types=[1, 2], order=['single', 'double'])

Where you can have a bond of the same index but different auxiliary values (types, orders, guesses). Only the first bond entry is stored as far as I can tell in the process of de-duplication when the TopologyGroup is constructed.

@richardjgowers and I discussed raising an error on duplicate entry in one of our meetings but this would be an API break from the current behaviour which is to remove duplicate entries preserving the first entry. This behaviour has a test core/test_universe.py::TestAddTopologyObjects::test_add_reversed_duplicates and about 50 other tests fail if you enforce duplication==error.

Currently the approach taken in #3777 matches this behaviour but as far as I can see there is not clear documentation of this behaviour (which entry will be removed).

The docs currently have the rather cryptic "Only one of the two permutations is stored, typically the one with the lower atom number first." in the developer notes and the following in the TopologyGroup docstring "Duplicate entries are automatically removed upon creation and combination of different Dicts. This means a bond between atoms 1 and 2 will only ever appear once in a dict despite both atoms 1 and 2 having the bond in their bond attribute." At least that was all that I could see about duplicate bonds.

My main points/questions are.

My instinct is to move forward with the current system to avoid and API break and potentially change behaviour if desired in 3.0?

Code to reproduce the behavior

Minimal example

import MDAnalysis as mda
from MDAnalysisTests import make_Universe
u = make_Universe()
u.add_bonds([[0, 1], [1, 0]], ['0', '1'], [False, True], ['do', 're'])
tg = u.bonds
tg._bondtypes # equal to 0
tg._guessed # equal to False
tg._order # equal to 'do'

# ... the same can be done with equivalent indices eg [[0,1],[0,1]]
# ... or with the ordering reversed [[1,0], [0,1]]
orbeckst commented 1 year ago

I would have thought that adding a duplicate bond should raise an exception.

(I don’t understand the rationale for the current behavior so I would consider changing to raising an error as a fix. Perhaps someone can explain why the current behavior is better.)

IAlibay commented 1 year ago

I know that at least Maestro (maybe also pymol) will let you add two connect entries if you have a double bond in a PDB file, raising an exception would probably kill off reading those types of files? edit: we could also pass that responsibility down to the reader itself tbh

orbeckst commented 1 year ago

That's a reader issue — it should create one bond with the correct bond order.

hmacdope commented 1 year ago

I agree with what we have discussed above that the reader should be responsible for inputting the correct bond orders.

To change this behaviour we would need to audit the readers for duplicate behaviour and then enforce in each reader which would be a rather large breaking change?

I am thinking that the Universe representation of some files (ESP pdb files which seemed to fail often my tests) will change. With this in mind this likely needs to be 3.0?

IAlibay commented 1 year ago

Honestly I'd probably be ok with this being pre-3.0, it feels more like a bug / undefined behaviour thing than an API break (unless there's some specific reason for its implementation - was there anything specific from git blame)?

hmacdope commented 1 year ago

Nothing specific from git-blame AFAIK. It will take a bit of time to audit all the readers that add bonds so watch this space.