DallasThomas / SACCHARIS

Improve functional predictions of uncharacterized sequences for any CAZyme or CBM family
6 stars 4 forks source link

Problem with identical sequence names #7

Closed reslp closed 4 years ago

reslp commented 4 years ago

Hi,

first of all thank you for SACCHARIS, it really makes it easier to characterize CAZymes. I have been using it quite a bit lately and for most families Saccharis runs fine. Now I have come across a problem when analysing GH11. This family contains characterized sequences which have identical sequence names (CAA46498). Most steps of Saccharis run but when it comes to tree reconstruction it fails because of the identical names. I know it would be easy to manually remove the sequence from the alignment, however my workflow is highly automated because I analyse lots of cazyme families with thousands of additional sequences which could also be cazymes and it is difficult to predict for which families this would happen too. I wanted to ask if it would be possible for you to add a step which checks the alignments for identical sequence names and fixes them (or removes one of the sequences). I would also greatly appreciate any additional suggestion on how to fix this.

many thanks already in advance!

kind regards,

Philipp

Also, here is the relevant output from Saccharis (in this case without additional sequences to decrease runtime). The created treefile is empty.

=============================================================================

*********************************************
* Prottest3 Tree Modeling Takes
*      --> 585 wallclock secs ( 0.05 usr  0.01 sys + 4276.09 cusr 31.93 csys = 4308.08 CPU) to run
*********************************************
FastTree - Tree building of /data/GH11/characterized/muscle/GH11.muscle_aln_mod_fast.phyi is underway
Building best tree - using FastTree
Commands To Run - 
        fasttree -lg -gamma -out FastTree_bootstrap.tree /data/GH11/characterized/muscle/GH11.muscle_aln_mod_fast.phyi; 

Threading: fasttree -lg -gamma -out FastTree_bootstrap.tree /data/GH11/characterized/muscle/GH11.muscle_aln_mod_fast.phyi; 
FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: /data/GH11/characterized/muscle/GH11.muscle_aln_mod_fast.phyi
Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Le-Gascuel 2008, CAT approximation with 20 rate categories
Non-unique name 'CAA46498.1b' in the alignment
************************** Threading Ended
Threading: 
************************** Threading Ended
FastTree has finished

Completed Building of Tree

==============================================================================

*********************************************
* Tree Building Takes
*      -->  0 wallclock secs ( 0.07 usr +  0.01 sys =  0.08 CPU) to run
*********************************************
*********************************************
* Cazy Pipeline Took in Total
*      --> 617 wallclock secs ( 0.88 usr  0.15 sys + 4288.29 cusr 35.98 csys = 4325.30 CPU) to finish
*********************************************

Finished Cazy pipeline analysis for group: characterized of family GH11
==============================================================================
==============================================================================

Cazy Pipeline Finished
DallasThomas commented 4 years ago

Hello Philipp,

First off, thank-you very much for pointing this out. This was a bug we have missed to date so I am glad you mentioned something about it.

Please download the latest update to cazy_extract.pl and replace your version with this one.

Basically what was going on is if Cazy has the same Accession ID on 2 different pages the duplicate screen of the script was missing that duplicate and hence you were getting the issue you have now.

This of course would have been detected sooner in one of our earlier versions, however due to name length restrictions we rename the headers to something unique right after the extract and this name does not revert till after Muscle.

Please test out this copy and let me know what you find. I will keep this issue open until you are satisfied things are working.

Thanks Dallas

reslp commented 4 years ago

Hi Dallas,

Thank you for the fast reply. Your fix seems to work fine. Thank you also for your explanation. I will do some additional tests but so far it runs smoothly. In case I come across the problem again I will reopen this thread.

Many thanks again!

all the best, Philipp