Open tzok opened 2 years ago
I agree, that would be best to improve. I have to say, I have never used mmCIF files a lot. And I was a bit lost in the documentation on all these different constructs and how they relate to the original PDB. So I am grateful you point these things out.
So the following tuples should both uniquely identify an Atom right?
label_seq_id
)auth_seq_id
, pdbx_PDB_ins_code
)
If that is the case I could think about supporting both for things like binary_find_atom
.Also do you have any feelings about how to handle the missing label_seq_id
on data from PDBs? It could be done by just creating an monotonically increasing counter on all Atoms.
Another point, is the auth_asym_id
as 'mis'used as the auth_seq_id
? If that is the case we maybe need to support both as well.
I also have some problems from the API design with the label_seq
, because it is defined to start at 1 and monotonically increasing, this means that any change in the structure will have to be reflected in the number being updated. So essentially it is the index in the pdb.atoms()
iterator plus one. I do not think that this is a very useful number to keep in the Atom struct itself as it cannot guarantee these properties on its own.
So I would propose to generate these upon save, and read the correct data in while parsing. So to not use these to uniquely identify atoms in functions like binary_find_atom
.
Sorry, I might have introduced some confusion with the wrong wording...
The tuple (label_asym_id
, label_seq_id
) represents a residue, not an atom. The label_asym_id
is the chain name, and the label_seq_id
is the number in the range [1-N] where N is the number of residues in that chain.
Similarly for the tuple (auth_asym_id
, auth_seq_id
, pdbx_PDB_ins_code
). It also represents a residue. auth_asym_id
is the chain name, auth_seq_id
is a number (but can be arbitrary) and pdbx_PDB_ins_code
is the insertion code which might be null (in mmCIF file it will have ?
constant then).
In some mmCIF files, label_asym_id
and auth_asym_id
can be different. The authors might give the same chain a different "label" and "auth" names (e.g., X
"auth" and A
"label")
The residue numbers label_seq_id
and auth_asym_id
can also be different -- mainly if "auth" numbering has gaps, which the "label" numbering cannot have.
Because of these two reasons, it would be great to identify a residue either by its "label" tuple or its "auth" tuple.
After some consideration I made the Residue numbers always equal to auth_seq_id
(unless undefined then label_seq_id
), because of the earlier mentioned impossibility to force the label
version to always be within spec. I also made the output of the program align with these specs, using auth
for the residue numbers and always providing the well defined label
numbers.
For the asym_id
I am not very sure yet how to handle them. Especially because I want to keep the old fashioned PDB files in mind. If you have any ideas feel free to share, and if you have any more information about the specs/expectations on these label
/auth
pairs please share them. I could not find any information about these in the specs (or the in docs you shared).
As I am enjoying holidays around now I will be slower to implement changes for a while.
In general, PDB's (chain, number, icode)
correspond to mmCIF's (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code)
.
I used the same trick as used for the seq_id
on asym_id
. I will keep thinking about a better way of handling these while also including all the other auth_*
data points. For me the main problem is that I do not have any indication from the wwPDB on how these datapoints should behave. So I will try to find some examples in the wild on how it actually is used.
In mmCIF, every atom is addressable in two ways:
label_asym_id
,label_seq_id
) e.g. ('A', 1)auth_asym_id
,auth_seq_id
,pdbx_PDB_ins_code
) e.g. ('A', -1, 'X')mmCIF guarantees that
label_seq_id
starts at 1 and increases by 1 without gaps, so an insertion code is not required here.But structural biologists sometimes use "strange" numbering schemes to adjust to what other researchers are used to (for example, the nucleotide numbers in ribosomes are meaningful for ribosome biologists). The second tuple is excellent here because the numbering can be negative or zero and may have gaps or insertion codes.
In the legacy PDB format, we only had the second addressing scheme: (chain, number, icode). So most (if not all) of the tools supporting mmCIF input, provide results using the tuple (
auth_asym_id
,auth_seq_id
,pdbx_PDB_ins_code
). Unfortunately, pdbtbx supports (label_asym_id
,label_seq_id
) addressing only, so I cannot match what the other tools provide with what pdbtbx parses.Can we have the
Residue
structure holding both tuples?