dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Assembly workflow questions #97

Closed kternus closed 6 years ago

kternus commented 6 years ago

Hey @charlesreid1 and @brooksph , is this an accurate figure for the assembly workflow? This figure assumes that we’re using the MetaQUAST capability within QUAST and not using PANDAseq, and those assumptions may be wrong. I wasn't sure!

assembly

Here are some additional notes that I found helpful, and we might want to include them in the dahak assembly workflow documentation (if this info isn't there already):

brooksph commented 6 years ago

Hi @kternus, yes this is correct. We plan to stick with these versions of the tools for the 1.0 release because we've verified that they work. Let's keep this issue open as a reminder to put that in the documentation.

kternus commented 6 years ago

Sounds good, @brooksph. Did you want to include khmer in the assembly workflow to remove any single-end reads that don't have pairs before running through metaSPAdes?

Here's a version with that:

_mondavi_illos_assembly_2018-06-27_1318

kternus commented 6 years ago

It looks like we're not using MetaQUAST, and we are using MultiQC:

assembly_v4

_mondavi_illos_assembly_2018-06-29_1120

brooksph commented 6 years ago

@kternus following the read filtering step unpaired reads are put into a separate file so we don't need to use khmer for that. We're using metaquast to generate assembly statistics and then compiling a visualizing the reports with multiQC. If we remove the khmer step and the reference to single or paired-end reads from your diagram we're good to go.

kternus commented 6 years ago

Thanks, @brooksph! I'll update the figure accordingly. We could also add more detail about the different file outputs from the read filtering workflow in the documentation/figure, if that would be helpful.

I didn't see metaquast.py in the code, so I thought we might not be using it. Glad we are!

kternus commented 6 years ago

assembly

brooksph commented 6 years ago

@kternus the reference to quast is in the assembly workflow pull request https://github.com/dahak-metagenomics/dahak/pull/92/files. There's a lot there but you find it in the workflows/assembly/Snakefile. Thanks again for putting the workflows together.

kternus commented 6 years ago

Yeah, I think the part I got confused by was why we're using quast.py:

Instead of metaquast.py:

And I was also wondering if these metaquast steps were happening within our workflow?

"If you run MetaQUAST without providing reference genomes, the tool will try to identify genome content of the metagenome. MetaQUAST uses BLASTN for aligning contigs to SILVA 16S rRNA database, i.e. FASTA file containing small subunit ribosomal RNA sequences. For each assembly, 50 reference genomes with top scores are chosen. Maximum number of references to download can be specified with --max-ref-number. Reference genomes for the chosen genomes are downloaded from the NCBI database to /quast_downloaded_references/. After that, MetaQUAST runs quast.py on all of them and removes reference genomes with low genome fraction (less than 10%) and proceeds the usual MetaQUAST analysis with the remaining references."
-http://quast.bioinf.spbau.ru/manual.html#sec2.5

brooksph commented 6 years ago

Yes, I agree that it is not clear. It's my understanding that we want to enable the end user to calculate assembly statistics and quast is fine for that, whether the input is a genome or metagenome. The metaquast approach introduces additional filtering steps and I'm not sure that's we want; however, it may be exactly what we want to do but it is not something I've thought a lot about. Maybe we should benchmark the two approaches. Do you have any experience using Metaquast?

kternus commented 6 years ago

It looks like the primary function of MetaQUAST (compared to normal QUAST) is to align the contigs to multiple reference genomes to generate all of the reference-dependent statistics that QUAST produces. I started making a list of those reference-dependent statistics below, and we'll probably want to make a note of this in our documentation.

For the Dahak 1.0 release, I think we should recommend that the reference-dependent statistics in the QUAST and MultiQC reports not be used for interpretation purposes. I doubt we'll be able to remove those statistics entirely from the reports, but we can at least make a note in the documentation about what assembly statistics we recommend Dahak 1.0 users look at or don't look at in these reports.

I think it was very nice of the MetaQUAST developers to offer an option to identify a set of appropriate reference genomes for their users, since it is unlikely that everyone analyzing metagenome assembly statistics knows exactly what organisms are in their sample. In future releases of Dahak (maybe 2.0?), I think we could use the species confidently identified by our taxonomic classification workflow as the set of references that MetaQUAST uses for its reference alignment-based statistics. I think that will improve its accuracy, compared to the 16S workflow that the MetaQUAST developers are offering. That's my two cents. :)

kternus commented 6 years ago

@stephenturner is voting for figures with simple lines, so here's an updated version of the assembly workflow figure:

assembly

I also updated the comparison figure to include assembled contigs as a possible input type, which is something @brooksph requested:

comparison

stephenturner commented 6 years ago

Much more professional looking IMHO. Better data:ink ratio, less clutter. Nice work.