Add 2000605_rce_diamond.pdf

ababaian / serratus

Ultra-deep search for novel viruses

http://serratus.io

GNU General Public License v3.0

251 stars 33 forks source link

Add 2000605_rce_diamond.pdf #150

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Benchmark of diamond vs. bowtie2 for Cov+ and Cov- datasets with holdout test of novel genus detection.

taltman commented 4 years ago

@rcedgar Can you document how you constructed the test files? Are the test files staged somewhere on S3 for others to test? Thanks!

rcedgar commented 4 years ago

Which test files? The FASTQs were fastq-dump'd. The reference databases are part of my private collection of stuff, not ideal but I don't have the time to document & sync everything up to S3 unless there is a good reason.

taltman commented 4 years ago

No fancy documentation needed, but could you just sync them up to a notebook on serratus-public? I had some tools that I wanted to benchmark as well, and it would be helpful if I ran them on the same test datasets as you. Thanks!

rcedgar commented 4 years ago

Datasets are given in the PDF: Cov+ = SRR11454614 and Cov- = ERR3568641. I now see that there are too many zeroes in the PDF filename but I don't see how to rename it :frown: Which tools do you think might be competitive with diamond? I think very unlikely there is an alternative, and if there is we need to make a decision very quickly. I have the benchmark infrastructure set up so it would make more sense for me to run it if there is a promising candidate.

ababaian commented 4 years ago

This is surprisingly faster then I thought and I think we can slot this in fairly easily and do a bit of optimization to get it working. Few notes:

Did you run bowtie2 in --very-sensitive-local as we do in production?
The test genome/proteome files need to be uploaded to an S3 notebook folder and referenced in the notebook where they're available.

rcedgar commented 4 years ago

bowtie2 was run with --very-sensitive-local --no-unal --no-head -U Input was done by cat'ing decompressed fastq to isolate compute cost of alignment step as such. pushing back on "must document everything I did so that it is 100% reproducible", disagree, not every experiment I do needs to be reproducible, that is overkill and slows me down. there needs to be a specific reason. more important I push forward in testing so we make a decision about what type of database to use (pol-only?, how to include other families etc.), and develop a diamond-based classifier.

ababaian commented 4 years ago

I'm not saying make everything 100% reproducible, just upload the protein files since Tomer wants to take a look at them and try other stuff. Not too big of an ask.

ababaian commented 4 years ago

@Tomer if you know of some other software you should discuss it upfront so we don't end up going to far down with Diamond if there's a viable alternative.