Closed sillitoe closed 8 years ago
This is due to Bio::LocatableSeq overriding the end res (expects sequential numbering)
704cd26 includes a workaround
Note: please do your own testing and let me know if there are issues
So just to check if I got the workaround right, this is the fasta of 1ere:
>1ere:E
SLALSLTADQMVSALLDAEPPILYSEFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDAHR
And this is the alignment generated from the script with the numbering in the header:
1ereE 1-253 .........................................................................................................................................................................................................................................................................................................-......................---SKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN...SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDAHRLhapt..........................................
So I need to map SEQRES to ATOM to get this to work, remove the residues before and after in this example...
Thanks!
That's the FASTA of 1ere based on...? SEQRES records? ATOM records? (I'm guessing ATOM records)
The sequence in the alignment is from the SEQRES records. That big fix was to get the numbering in the headers consistent with the SEQRES records so that you should now be able to figure out the SEQRES position of each of the residues in the sequence.
Now you know what the SEQRES position is for each residue, you should be able to use your own SEQRES / ATOM lookup to decide how to go from the sequence to the structure.
So the protein starts with the residues "SKKN" - these might be biologically important (which is why we include them), but it turns out they aren't observed in the PDB structure (for whatever reason). It's up to you how you want to handle that.
Great, so I got it correct :) This is exactly what we needed to fix this.
Thanks Ian!
No problem.
Please do your own consistency checks to make sure you are getting the expected lengths and the AA residues match up, etc. I've added sanity checks in the code that applies the SEQRES fix, but there's lots of potential edge cases and it would be easy to be one residue out, etc.
Note: you can see the full mapping ./data/domain-sequence-numbering.v4_1_0.txt.gz
So mmCIF file has this mapping between SEQRES to ATOM (thanks Roman!)
with the example above:
E 1 1 SER 1 301 ? ? ? E . n
E 1 2 LYS 2 302 ? ? ? E . n
E 1 3 LYS 3 303 ? ? ? E . n
E 1 4 ASN 4 304 ? ? ? E . n
E 1 5 SER 5 305 305 SER SER E . n
E 1 6 LEU 6 306 306 LEU LEU E . n
E 1 7 ALA 7 307 307 ALA ALA E . n
E 1 8 LEU 8 308 308 LEU LEU E . n
E 1 9 SER 9 309 309 SER SER E . n
E 1 10 LEU 10 310 310 LEU LEU E . n
E 1 11 THR 11 311 311 THR THR E . n
E 1 12 ALA 12 312 312 ALA ALA E . n
E 1 13 ASP 13 313 313 ASP ASP E . n
E 1 14 GLN 14 314 314 GLN GLN E . n
E 1 15 MET 15 315 315 MET MET E . n
E 1 16 VAL 16 316 316 VAL VAL E . n
E 1 17 SER 17 317 317 SER SER E . n
E 1 18 ALA 18 318 318 ALA ALA E . n
E 1 19 LEU 19 319 319 LEU LEU E . n
E 1 20 LEU 20 320 320 LEU LEU E . n
E 1 21 ASP 21 321 321 ASP ASP E . n
E 1 22 ALA 22 322 322 ALA ALA E . n
E 1 23 GLU 23 323 323 GLU GLU E . n
E 1 24 PRO 24 324 324 PRO PRO E . n
E 1 25 PRO 25 325 325 PRO PRO E . n
E 1 26 ILE 26 326 326 ILE ILE E . n
Closing this ticket - issue solved in 704cd26
Investigating an issue from #4
P03372.fa
1.10.565.10-FF-338.fasta
The reported start/stop of the first domain in the alignment:
>cath|4.1.0|1ereE00/305-557
Doesn't match the start/stop of the domain according to CATH:
$ head -n 1 /cath/data/v4_1_0/funfam/families/1.10.565.10/338.core.nofrag.aln
Seems that something in the
align_hit
pipeline is altering the stop residue which is more than a little confusing.