Closed boulund closed 3 years ago
I recently re-read the SeqAnswers-thread on BBMap's sketching tools and saw that Brian suggested potentially running tadpole.sh
in front of sendsketch.sh
to get rid of spurious/error kmers.
This is fairly quick, not very memory intensive, and I think it can clean up the output from sendsketch.sh
, making it a more realistic candidate to replace the memory-heavy mash screen
solution we have now. It removes the need for users to provide a mash screen database file and it is a lot faster.
Implemented in v 3.0
I missed that Brian Bushnell last year added a tool to the BBTools suite that does something extremely similar to what
mash screen
does:sendsketch.sh
.It sends a sketch of an input file (or pair of input files, compressed or not, as per usual BBTool manners) to JGI's sketch-server to compare against reference sketches of
nt
,refseq
(default),silva
, orimg
. It is very fast. Here's an example using the first 1000 or so reads from a Helicobacter pylori sample:As you can see, it works very well! Of course, it would require some tweaking of
assess_mash_screen.py
, to parsesendsketch.sh
output instead, and some additional testing like the testing and validation we performed for ourmash screen
evaluations, but it shouldn't be too much work to be honest.We could consider eventually replacing
mash screen
withsendsketch.sh
, thus removing the entire mash dependency, and removing the need to download a 700MB+ file with sketches of RefSeq genomes.edit: Here's Brian Bushnell's "announcement" of
sendsketsh.sh
on BioStars: https://www.biostars.org/p/234837/