UCLOrengoGroup / cath-tools-seqscan

CATH: scan/align protein sequences against functional families
3 stars 0 forks source link

Misalignment of start-stop #9

Closed dudimarcus closed 7 years ago

dudimarcus commented 7 years ago

Hi, I noticed that when the sequence starts above >1 there is always misalignment to the pdb SEQRES which means the fix for start-stop could be not calibrated.

for example:

good: correctly starts in 1

cath|4.1.0|1a52A00/1-258 .................................................................-.....................mIKRSKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN...SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHERLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDAHRLhapts

bad: should actually started in 24 not 8

cath|4.1.0|2jfaA00/8-252 .................................................................-......................-------SLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN...SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNV----------------.....

sillitoe commented 7 years ago

Working on this now...

dudimarcus commented 7 years ago

Great! thanks Ian

sillitoe commented 7 years ago

Okay. Progress.

Should be finished by the time I get home - I'll commit the new lookup file and all should work correctly.

dudimarcus commented 7 years ago

That's great! thanks Ian. Looking forward to testing it tomorrow, previous version already showed great results.

sillitoe commented 7 years ago

Okay, this should be fixed now.

As a summary:

Can you confirm this works and close the ticket if you're happy.

dudimarcus commented 7 years ago

Yay, its fixed, all misalignment examples are corrected now.

Thanks Ian!

sillitoe commented 7 years ago

Great.

For the record - there were 4 domains (of 383,628) where the domain boundaries of our internal files do not match the results of the original algorithm that was used to generate them.

I'm not yet sure how this happened, but just to let you know that if you happen to hit these domain ids (they are all identical sequences) then the script will throw an exception.

$ grep 'Error' data/domain-sequence-numbering.v4_1_0.txt
# Error: 4ikpA01 has mismatching domain boundaries
# Error: 4ikpB01 has mismatching domain boundaries
# Error: 4ikpC01 has mismatching domain boundaries
# Error: 4ikpD01 has mismatching domain boundaries