biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
581 stars 92 forks source link

Determinism when resolving atom alternative locations #568

Open nscorley opened 1 month ago

nscorley commented 1 month ago

Problem description:

In many mmCIF files, atoms exist equally in two locations, each with an occupancy of 0.5. For example, this screenshot from PDB ID 1adl: image

Currently, when calling get_structure() with altloc=occupancy, there's some instability present in which altloc is chosen for a given residue in such situations. Sometimes the A conformation is chosen; sometimes the B conformation. From what I can tell, the lack of determinism may arise from the use of a set to store altlocs within the filter_highest_occupancy_altloc function here.

Although either choice is theoretically correct, such instability makes results much more challenging to debug.

Proposed solution:

Change the implementation of filter_highest_occupancy_altloc to be deterministic when two altlocs have equal occupancy; e.g., use a sorted list() rather than a set. Given the length of typical altloc sets, there will be minimal to no performance impact.

padix-key commented 1 month ago

Hi, thanks for pinpointing the problem. I think your analysis is correct. If I remember correctly, the set ensures that each altloc ID only appears once, but there is no reason to not enveloping the set with sorted().

Would you like to create a fix?

nscorley commented 1 month ago

Hi,

Yes, we definitely need the set() since we're storing one altloc ID per atom, and thus have duplicates within a residue. Enveloping with sorted() should be a quick fix. I'll make a PR shortly! Thanks for your quick response!