Incorrect _label_seq_id numbering

tristanic commented 1 year ago

I've attached the result of a test run, building 7uxa from the deposited FASTA and map, and noticed what I think is a bug in the sequence numbering. If you look at residues 107-112 (_auth_seq_id numbering) of chain Db, ModelAngelo has missed one residue (sequence should be QGRQRL, modelled sequence is QGRRL). While the _auth_seq_id skips a number at the break (going from 109-111), the _label_seq_id does not. If I understand the mmCIF specification correctly, this is wrong - _label_seq_id should reflect the authoritative numbering derived from the real-world sequence, where _auth_seq_id is the more flexible label that follow whatever system the authors choose. In ChimeraX (and probably in other viewers, but I haven't checked) this leads to the chain being treated as continuous at this point.

Upshot: I believe the residues with assigned sequence need to be mapped back to the original FASTA sequence for the purpose of assigning _label_seq_id.

output.zip

jamaliki commented 1 year ago

Hi Tristan!

So, if I understand correctly, I should be using the _auth_seq_id field instead of _label_seq_id? So _label_seq_id is the numbering derived from the sequence, while _auth_seq_id can be anything?

tristanic commented 1 year ago

That’s my understanding, yes. I always find the mmCIF documentation a bit difficult to get around, but https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.auth_seq_id.html and https://mmcif.wwpdb.org/docs/tutorials/content/atomic-description.html make it pretty clear.

On Fri, 9 Jun 2023 at 16:29, Kiarash Jamali @.***> wrote:

Hi Tristan!

So, if I understand correctly, I should be using the _auth_seq_id field instead of _label_seq_id? So _label_seq_id is the numbering derived from the sequence, while _auth_seq_id can be anything?

— Reply to this email directly, view it on GitHub https://github.com/3dem/model-angelo/issues/51#issuecomment-1584769881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YA4BOMM5HQV475AMPTXKM6L3ANCNFSM6AAAAAAZAWXV6A . You are receiving this because you authored the thread.Message ID: @.***>

jamaliki commented 1 year ago

I see. So this is tricky, because it seems that the IO class I use to save cif files, MMCIFIO, does not allow me to set _label_seq_id, only _auth_seq_id: https://github.com/biopython/biopython/blob/d5dfda8e20a65dfaec6c84527baecb0adff939c2/Bio/PDB/mmcifio.py#L309

I didn't want to have ModelAngelo carry it's own cif IO tools, but maybe that needs to be done

tristanic commented 1 year ago

Hmm... I've never used BioPython (and from a glance the documentation seems a little sparse). I wonder if it will let you insert a Residue into the chain with the right name but no atoms?

On Fri, Jun 9, 2023 at 5:02 PM Kiarash Jamali @.***> wrote:

I see. So this is tricky, because it seems that the IO class I use to save cif files, MMCIFIO does not allow me to set _label_seq_id, only _auth_seq_id: https://github.com/biopython/biopython/blob/d5dfda8e20a65dfaec6c84527baecb0adff939c2/Bio/PDB/mmcifio.py#L309

I didn't want to have ModelAngelo carry it's own cif IO tools, but maybe that needs to be done

— Reply to this email directly, view it on GitHub https://github.com/3dem/model-angelo/issues/51#issuecomment-1584816156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YEWF2WBHJ2CFAQ42U3XKNCJPANCNFSM6AAAAAAZAWXV6A . You are receiving this because you authored the thread.Message ID: @.***>

jamaliki commented 1 year ago

Oh that is smart, to introduce insertions?

What do you use instead, would that allow me to do this correctly? I am not opposed to changing the parsing package.

tristanic commented 1 year ago

Me? I do pretty much everything in ChimeraX, so that's probably not that much help to you. :) They (more specifically Greg Couch) wrote their own mmCIF I/O libraries.

Looking at the BioPython _save_structure() method it seems that, at least, would do the right thing if the chain contained empty Residue placeholders for unmodelled residues... but so far I've been unable to find any definitive answer in the BioPython documentation about this use case so I don't know if it's officially supported. Couldn't hurt to try, though?

jamaliki commented 1 year ago

Yes, I will definitely try. The longer term fix (proposed by @tomburnley) seems to be to move towards gemmi for the I/O, which will hopefully be done soon(-ish)

jgreener64 commented 1 year ago

Indeed if you have an empty residue it will give a gap in _atom_site.label_seq_id, see for example:

from Bio.PDB.mmcifio import MMCIFIO
from Bio.PDB.StructureBuilder import StructureBuilder

struct = StructureBuilder()
struct.init_structure("1")
struct.init_seg("1")
struct.init_model("1")
struct.init_chain("A")
struct.init_residue(f"ALA", " ", 1, " ")
struct.init_atom("CA", [1.0, 2.0, 3.0], 0, 1, " ", "CA", "C")
struct.init_atom("CB", [1.0, 2.0, 3.0], 0, 1, " ", "CB", "C")
struct.init_residue(f"LEU", " ", 2, " ")
struct.init_residue(f"GLU", " ", 3, " ")
struct.init_atom("CA", [1.0, 2.0, 3.0], 0, 1, " ", "CA", "C")
struct.init_atom("CB", [1.0, 2.0, 3.0], 0, 1, " ", "CB", "C")
struct = struct.get_structure()
io = MMCIFIO()
io.set_structure(struct)
io.save("out.cif")

data_1
#
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.auth_seq_id
_atom_site.auth_asym_id
_atom_site.pdbx_PDB_model_num
ATOM 1 CA CA . ALA A ? 1 ? 1.000 2.000 3.000 1 0 1 A 1 
ATOM 2 ?  CB . ALA A ? 1 ? 1.000 2.000 3.000 1 0 1 A 1 
ATOM 3 CA CA . GLU A ? 3 ? 1.000 2.000 3.000 1 0 3 A 1 
ATOM 4 ?  CB . GLU A ? 3 ? 1.000 2.000 3.000 1 0 3 A 1 
#

So you could try adding residues as appropriate. If you point me to the place where the structure is written out I might be able to help.

Alternatively you could just map _atom_site.label_seq_id to _atom_site.auth_seq_id at the dictionary level:

from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from Bio.PDB.mmcifio import MMCIFIO

dic = MMCIF2Dict("2V64.cif")
dic["_atom_site.label_seq_id"] = dic["_atom_site.auth_seq_id"]
io = MMCIFIO()
io.set_dict(dic)
io.save("out.cif")

jamaliki commented 1 year ago

Oh the dictionary stuff is actually super useful. I think that will solve it!

jamaliki commented 1 year ago

Hi, I believe this is now fixed with PR #63

3dem / model-angelo

Incorrect _label_seq_id numbering #51