Open nscorley opened 1 month ago
Hi, thanks for pinpointing the problem. I think your analysis is correct. If I remember correctly, the set
ensures that each altloc ID only appears once, but there is no reason to not enveloping the set
with sorted()
.
Would you like to create a fix?
Hi,
Yes, we definitely need the set()
since we're storing one altloc
ID per atom, and thus have duplicates within a residue. Enveloping with sorted()
should be a quick fix. I'll make a PR shortly! Thanks for your quick response!
Problem description:
In many mmCIF files, atoms exist equally in two locations, each with an occupancy of 0.5. For example, this screenshot from PDB ID![image](https://github.com/biotite-dev/biotite/assets/8575391/ea0f24da-473f-4b62-9c9d-0ca0730d6789)
1adl
:Currently, when calling
get_structure()
withaltloc=occupancy
, there's some instability present in whichaltloc
is chosen for a given residue in such situations. Sometimes theA
conformation is chosen; sometimes theB
conformation. From what I can tell, the lack of determinism may arise from the use of aset
to storealtlocs
within thefilter_highest_occupancy_altloc
function here.Although either choice is theoretically correct, such instability makes results much more challenging to debug.
Proposed solution:
Change the implementation of
filter_highest_occupancy_altloc
to be deterministic when twoaltlocs
have equal occupancy; e.g., use a sortedlist()
rather than a set. Given the length of typicalaltloc
sets, there will be minimal to no performance impact.