douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
49 stars 12 forks source link

Bad handling of PDB files containing mixed upper and lower case chains #109

Closed OWissett closed 1 year ago

OWissett commented 1 year ago

If you have a PDB file with two chains with the same letter ID, but in different case, e.g., B and b, it is not possible to distinguish these chains.

Some way of distinguishing these should be implemented.

I know that strictly speaking PDB files should only have upper case chains, but this isn't always the case.

Look at 7WFF, which contains both lowercase and uppercase letters.

douweschulte commented 1 year ago

(I was on holiday, so sorry for the late reply) I originally implemented it in this way because this is what I thought the specification states should be the correct behaviour. But if many other programs do not follow this, and even RCSB ignores it then we should allow this behaviour as well. And Upon a new reading of the specification I cannot find any rules stating this should be the behaviour, this is the only rule I found:

Non-blank alphanumerical character is used for chain identifier.

Here is additional comments explaining what is commonly used: https://biology.stackexchange.com/questions/82862/why-do-chain-identifiers-in-pdb-have-no-standard-starting-chain-id-type#:~:text=Chain%20IDs%20are%20assigned%20by%20authors%20who%20submit,identifier.%20Usually%2C%20the%20chains%20are%20assigned%20uppercase%20letters.

OWissett commented 1 year ago

From my experience, it is pretty common to have B and b, particularly when it might be in an asymmetric unit containing two biological assemblies.

douweschulte commented 1 year ago

That indeed sounds like a reasonable use for them. I will merge the PR once done.