mmCIF format, atoms addressable only via label_* fields and not via auth_* fields

douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

https://crates.io/crates/pdbtbx

MIT License

54 stars 17 forks source link

mmCIF format, atoms addressable only via label_* fields and not via auth_* fields #95

Open tzok opened 2 years ago

tzok commented 2 years ago

In mmCIF, every atom is addressable in two ways:

As a tuple (label_asym_id, label_seq_id) e.g. ('A', 1)
As a tuple (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code) e.g. ('A', -1, 'X')

mmCIF guarantees that label_seq_id starts at 1 and increases by 1 without gaps, so an insertion code is not required here.

But structural biologists sometimes use "strange" numbering schemes to adjust to what other researchers are used to (for example, the nucleotide numbers in ribosomes are meaningful for ribosome biologists). The second tuple is excellent here because the numbering can be negative or zero and may have gaps or insertion codes.

In the legacy PDB format, we only had the second addressing scheme: (chain, number, icode). So most (if not all) of the tools supporting mmCIF input, provide results using the tuple (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code). Unfortunately, pdbtbx supports (label_asym_id, label_seq_id) addressing only, so I cannot match what the other tools provide with what pdbtbx parses.

Can we have the Residue structure holding both tuples?

douweschulte commented 2 years ago

I agree, that would be best to improve. I have to say, I have never used mmCIF files a lot. And I was a bit lost in the documentation on all these different constructs and how they relate to the original PDB. So I am grateful you point these things out.

So the following tuples should both uniquely identify an Atom right?

(label_seq_id)
(auth_seq_id, pdbx_PDB_ins_code) If that is the case I could think about supporting both for things like binary_find_atom.

Also do you have any feelings about how to handle the missing label_seq_id on data from PDBs? It could be done by just creating an monotonically increasing counter on all Atoms.

Another point, is the auth_asym_id as 'mis'used as the auth_seq_id? If that is the case we maybe need to support both as well.

douweschulte commented 2 years ago

I also have some problems from the API design with the label_seq, because it is defined to start at 1 and monotonically increasing, this means that any change in the structure will have to be reflected in the number being updated. So essentially it is the index in the pdb.atoms() iterator plus one. I do not think that this is a very useful number to keep in the Atom struct itself as it cannot guarantee these properties on its own.

So I would propose to generate these upon save, and read the correct data in while parsing. So to not use these to uniquely identify atoms in functions like binary_find_atom.

tzok commented 2 years ago

Sorry, I might have introduced some confusion with the wrong wording...

The tuple (label_asym_id, label_seq_id) represents a residue, not an atom. The label_asym_id is the chain name, and the label_seq_id is the number in the range [1-N] where N is the number of residues in that chain.

Similarly for the tuple (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code). It also represents a residue. auth_asym_id is the chain name, auth_seq_id is a number (but can be arbitrary) and pdbx_PDB_ins_code is the insertion code which might be null (in mmCIF file it will have ? constant then).

In some mmCIF files, label_asym_id and auth_asym_id can be different. The authors might give the same chain a different "label" and "auth" names (e.g., X "auth" and A "label")

The residue numbers label_seq_id and auth_asym_id can also be different -- mainly if "auth" numbering has gaps, which the "label" numbering cannot have.

Because of these two reasons, it would be great to identify a residue either by its "label" tuple or its "auth" tuple.

douweschulte commented 2 years ago

After some consideration I made the Residue numbers always equal to auth_seq_id (unless undefined then label_seq_id), because of the earlier mentioned impossibility to force the label version to always be within spec. I also made the output of the program align with these specs, using auth for the residue numbers and always providing the well defined label numbers.

For the asym_id I am not very sure yet how to handle them. Especially because I want to keep the old fashioned PDB files in mind. If you have any ideas feel free to share, and if you have any more information about the specs/expectations on these label/auth pairs please share them. I could not find any information about these in the specs (or the in docs you shared).

As I am enjoying holidays around now I will be slower to implement changes for a while.

tzok commented 2 years ago

In general, PDB's (chain, number, icode) correspond to mmCIF's (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code).

douweschulte commented 2 years ago

I used the same trick as used for the seq_id on asym_id. I will keep thinking about a better way of handling these while also including all the other auth_* data points. For me the main problem is that I do not have any indication from the wwPDB on how these datapoints should behave. So I will try to find some examples in the wild on how it actually is used.