Closed kternus closed 6 years ago
Hi @kternus, yes this is correct. We plan to stick with these versions of the tools for the 1.0 release because we've verified that they work. Let's keep this issue open as a reminder to put that in the documentation.
Sounds good, @brooksph. Did you want to include khmer in the assembly workflow to remove any single-end reads that don't have pairs before running through metaSPAdes?
Here's a version with that:
It looks like we're not using MetaQUAST, and we are using MultiQC:
@kternus following the read filtering step unpaired reads are put into a separate file so we don't need to use khmer for that. We're using metaquast to generate assembly statistics and then compiling a visualizing the reports with multiQC. If we remove the khmer step and the reference to single or paired-end reads from your diagram we're good to go.
Thanks, @brooksph! I'll update the figure accordingly. We could also add more detail about the different file outputs from the read filtering workflow in the documentation/figure, if that would be helpful.
I didn't see metaquast.py in the code, so I thought we might not be using it. Glad we are!
@kternus the reference to quast is in the assembly workflow pull request https://github.com/dahak-metagenomics/dahak/pull/92/files. There's a lot there but you find it in the workflows/assembly/Snakefile. Thanks again for putting the workflows together.
Yeah, I think the part I got confused by was why we're using quast.py:
Instead of metaquast.py:
And I was also wondering if these metaquast steps were happening within our workflow?
"If you run MetaQUAST without providing reference genomes, the tool will try to identify genome content of the metagenome. MetaQUAST uses BLASTN for aligning contigs to SILVA 16S rRNA database, i.e. FASTA file containing small subunit ribosomal RNA sequences. For each assembly, 50 reference genomes with top scores are chosen. Maximum number of references to download can be specified with --max-ref-number. Reference genomes for the chosen genomes are downloaded from the NCBI database to
-http://quast.bioinf.spbau.ru/manual.html#sec2.5
Yes, I agree that it is not clear. It's my understanding that we want to enable the end user to calculate assembly statistics and quast is fine for that, whether the input is a genome or metagenome. The metaquast approach introduces additional filtering steps and I'm not sure that's we want; however, it may be exactly what we want to do but it is not something I've thought a lot about. Maybe we should benchmark the two approaches. Do you have any experience using Metaquast?
It looks like the primary function of MetaQUAST (compared to normal QUAST) is to align the contigs to multiple reference genomes to generate all of the reference-dependent statistics that QUAST produces. I started making a list of those reference-dependent statistics below, and we'll probably want to make a note of this in our documentation.
For the Dahak 1.0 release, I think we should recommend that the reference-dependent statistics in the QUAST and MultiQC reports not be used for interpretation purposes. I doubt we'll be able to remove those statistics entirely from the reports, but we can at least make a note in the documentation about what assembly statistics we recommend Dahak 1.0 users look at or don't look at in these reports.
I think it was very nice of the MetaQUAST developers to offer an option to identify a set of appropriate reference genomes for their users, since it is unlikely that everyone analyzing metagenome assembly statistics knows exactly what organisms are in their sample. In future releases of Dahak (maybe 2.0?), I think we could use the species confidently identified by our taxonomic classification workflow as the set of references that MetaQUAST uses for its reference alignment-based statistics. I think that will improve its accuracy, compared to the 16S workflow that the MetaQUAST developers are offering. That's my two cents. :)
@stephenturner is voting for figures with simple lines, so here's an updated version of the assembly workflow figure:
I also updated the comparison figure to include assembled contigs as a possible input type, which is something @brooksph requested:
Much more professional looking IMHO. Better data:ink ratio, less clutter. Nice work.
Hey @charlesreid1 and @brooksph , is this an accurate figure for the assembly workflow? This figure assumes that we’re using the MetaQUAST capability within QUAST and not using PANDAseq, and those assumptions may be wrong. I wasn't sure!
Here are some additional notes that I found helpful, and we might want to include them in the dahak assembly workflow documentation (if this info isn't there already):