UCLOrengoGroup / cath-tools-seqscan

CATH: scan/align protein sequences against functional families
3 stars 0 forks source link

Check start/stop of alignment entries #5

Closed sillitoe closed 8 years ago

sillitoe commented 8 years ago

Investigating an issue from #4

The reported start/stop of the first domain in the alignment:

>cath|4.1.0|1ereE00/305-557

Doesn't match the start/stop of the domain according to CATH:


$ head -n 1 /cath/data/v4_1_0/domfasta/1ereE00.ATOM.fa

> cath|4.1.0|1ereE00/305-548
> 
> ```
> ```

or the FunFam alignment:

$ head -n 1 /cath/data/v4_1_0/funfam/families/1.10.565.10/338.core.nofrag.aln

cath|4.1.0|1ereE00/305-548 ...

Seems that something in the align_hit pipeline is altering the stop residue which is more than a little confusing.

sillitoe commented 8 years ago

This is due to Bio::LocatableSeq overriding the end res (expects sequential numbering)

https://github.com/bioperl/bioperl-live/blob/73c446c69a774e3e51fc15912f2c4da6bb18f681/Bio/LocatableSeq.pm#L201-L204

sillitoe commented 8 years ago

704cd26 includes a workaround

Note: please do your own testing and let me know if there are issues

dudimarcus commented 8 years ago

So just to check if I got the workaround right, this is the fasta of 1ere:

>1ere:E
SLALSLTADQMVSALLDAEPPILYSEFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDAHR

And this is the alignment generated from the script with the numbering in the header:

1ereE   1-253   .........................................................................................................................................................................................................................................................................................................-......................---SKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN...SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDAHRLhapt..........................................

So I need to map SEQRES to ATOM to get this to work, remove the residues before and after in this example...

Thanks!

sillitoe commented 8 years ago

That's the FASTA of 1ere based on...? SEQRES records? ATOM records? (I'm guessing ATOM records)

The sequence in the alignment is from the SEQRES records. That big fix was to get the numbering in the headers consistent with the SEQRES records so that you should now be able to figure out the SEQRES position of each of the residues in the sequence.

Now you know what the SEQRES position is for each residue, you should be able to use your own SEQRES / ATOM lookup to decide how to go from the sequence to the structure.

So the protein starts with the residues "SKKN" - these might be biologically important (which is why we include them), but it turns out they aren't observed in the PDB structure (for whatever reason). It's up to you how you want to handle that.

dudimarcus commented 8 years ago

Great, so I got it correct :) This is exactly what we needed to fix this.

Thanks Ian!

sillitoe commented 8 years ago

No problem.

Please do your own consistency checks to make sure you are getting the expected lengths and the AA residues match up, etc. I've added sanity checks in the code that applies the SEQRES fix, but there's lots of potential edge cases and it would be easy to be one residue out, etc.

sillitoe commented 8 years ago

Note: you can see the full mapping ./data/domain-sequence-numbering.v4_1_0.txt.gz

dudimarcus commented 8 years ago

So mmCIF file has this mapping between SEQRES to ATOM (thanks Roman!)

with the example above:

E 1 1 SER 1 301 ? ? ? E . n
E 1 2 LYS 2 302 ? ? ? E . n
E 1 3 LYS 3 303 ? ? ? E . n
E 1 4 ASN 4 304 ? ? ? E . n
E 1 5 SER 5 305 305 SER SER E . n
E 1 6 LEU 6 306 306 LEU LEU E . n
E 1 7 ALA 7 307 307 ALA ALA E . n
E 1 8 LEU 8 308 308 LEU LEU E . n
E 1 9 SER 9 309 309 SER SER E . n
E 1 10 LEU 10 310 310 LEU LEU E . n
E 1 11 THR 11 311 311 THR THR E . n
E 1 12 ALA 12 312 312 ALA ALA E . n
E 1 13 ASP 13 313 313 ASP ASP E . n
E 1 14 GLN 14 314 314 GLN GLN E . n
E 1 15 MET 15 315 315 MET MET E . n
E 1 16 VAL 16 316 316 VAL VAL E . n
E 1 17 SER 17 317 317 SER SER E . n
E 1 18 ALA 18 318 318 ALA ALA E . n
E 1 19 LEU 19 319 319 LEU LEU E . n
E 1 20 LEU 20 320 320 LEU LEU E . n
E 1 21 ASP 21 321 321 ASP ASP E . n
E 1 22 ALA 22 322 322 ALA ALA E . n
E 1 23 GLU 23 323 323 GLU GLU E . n
E 1 24 PRO 24 324 324 PRO PRO E . n
E 1 25 PRO 25 325 325 PRO PRO E . n
E 1 26 ILE 26 326 326 ILE ILE E . n
sillitoe commented 8 years ago

Closing this ticket - issue solved in 704cd26