ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

Replace mash screen with BBTool's sendsketch.sh? #54

Closed boulund closed 3 years ago

boulund commented 6 years ago

I missed that Brian Bushnell last year added a tool to the BBTools suite that does something extremely similar to what mash screen does: sendsketch.sh.

It sends a sketch of an input file (or pair of input files, compressed or not, as per usual BBTool manners) to JGI's sketch-server to compare against reference sketches of nt, refseq (default), silva, or img. It is very fast. Here's an example using the first 1000 or so reads from a Helicobacter pylori sample:

(base) [fredrik.boulund@ctmr-nas bactpipe_test]$ sendsketch.sh in=sample_R1.fastq.gz                                                          
Adding /home/ctmr/anaconda3/opt/bbmap-37.90/resources/blacklist_refseq_species_300.sketch to blacklist.                                       
Loaded 1 sketch in      0.716 seconds.                                                                                                        

Query: HWI-M03284:45:000000000-AWVDC:1:1101:13827:1954 1:N:0:GAATTCGTGTACTGAC   DB: RefSeq      SketchLen: 8587 Seqs: 26873     Bases: 8088773   gSize: 2134365  Quality: 0.9074 File: sample_R1.fastq.gz                                                                                      
WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName                                               
36.19%  30.08%  96.37%  54.43%  34.52%  2752    0       2227    102617  2560243 63      Helicobacter pylori SS1                               
7.66%   5.46%   91.08%  47.88%  65.86%  469     3       2463    382638  1530316 2       Helicobacter acinonychis str. Sheeba                  
0.95%   0.77%   84.42%  36.38%  70.55%  66      0       2463    1163745 1730296 2       Helicobacter cetorum MIT 99-5656                      
0.79%   0.55%   83.87%  42.33%  70.77%  47      0       2463    104628  1475704 42      Helicobacter suis                                     
3.03%   0.03%   88.06%  100.00% 71.28%  3       0       2463    1204178 26419   1       Helicobacter phage KHP40                              
0.31%   0.07%   81.06%  6.77%   70.61%  10      0       937     543736  9071270 292     Rhodococcus opacus PD630                              
0.12%   0.12%   78.34%  30.08%  71.20%  10      0       2463    1578720 2046283 204     Helicobacter ailurogastricus                          
0.12%   0.09%   78.37%  37.89%  71.22%  8       0       2463    936155  1631038 1       Helicobacter felis ATCC 49179                         
0.11%   0.09%   78.10%  22.32%  70.79%  8       0       2087    35817   2772762 392     Helicobacter heilmannii                               
0.09%   0.07%   77.36%  35.40%  71.25%  6       0       2463    1002805 1729718 147     Helicobacter bizzozeronii CCUG 35545                  
0.07%   0.06%   76.61%  32.44%  71.26%  5       0       2463    537972  1901982 44      Helicobacter pullorum MIT 98-5489                     
0.06%   0.05%   76.58%  40.04%  71.27%  4       0       2463    679897  1539411 1       Helicobacter mustelae 12198                           
0.06%   0.05%   76.43%  37.98%  71.27%  4       0       2463    537970  1610516 24      Helicobacter canadensis MIT 98-5491                   
0.06%   0.05%   76.33%  36.61%  71.27%  4       0       2463    556267  1653215 21      Helicobacter winghamensis ATCC BAA-430                
0.06%   0.05%   76.16%  34.44%  71.27%  4       0       2463    1449345 1781940 29      Helicobacter rodentium ATCC 700285                    
0.05%   0.05%   76.06%  33.23%  71.27%  4       0       2463    1476199 1851996 61      Helicobacter sp. 12S02634-8                           
0.05%   0.05%   75.71%  29.32%  71.27%  4       0       2463    1325130 2093688 49      Helicobacter fennelliae MRY12-0050                    
0.05%   0.05%   75.72%  27.56%  71.39%  4       0       2398    211     2212464 224     Helicobacter sp. CLO-3                                
0.05%   0.03%   75.99%  43.25%  71.28%  3       0       2463    1408442 1419428 11      Helicobacter pametensis ATCC 51478                    
0.04%   0.03%   75.39%  34.70%  71.28%  3       0       2463    1905759 1770278 45      Helicobacter sp. 13S00477-4                           

Total Time:     2.198 seconds.                                                                                                                

As you can see, it works very well! Of course, it would require some tweaking of assess_mash_screen.py, to parse sendsketch.sh output instead, and some additional testing like the testing and validation we performed for our mash screen evaluations, but it shouldn't be too much work to be honest.

We could consider eventually replacing mash screen with sendsketch.sh, thus removing the entire mash dependency, and removing the need to download a 700MB+ file with sketches of RefSeq genomes.

edit: Here's Brian Bushnell's "announcement" of sendsketsh.sh on BioStars: https://www.biostars.org/p/234837/

boulund commented 5 years ago

I recently re-read the SeqAnswers-thread on BBMap's sketching tools and saw that Brian suggested potentially running tadpole.sh in front of sendsketch.sh to get rid of spurious/error kmers. This is fairly quick, not very memory intensive, and I think it can clean up the output from sendsketch.sh, making it a more realistic candidate to replace the memory-heavy mash screen solution we have now. It removes the need for users to provide a mash screen database file and it is a lot faster.

http://seqanswers.com/forums/showthread.php?t=74019

thorellk commented 3 years ago

Implemented in v 3.0