Assembly workflow questions

kternus commented 6 years ago

Hey @charlesreid1 and @brooksph , is this an accurate figure for the assembly workflow? This figure assumes that we’re using the MetaQUAST capability within QUAST and not using PANDAseq, and those assumptions may be wrong. I wasn't sure!

assembly

Here are some additional notes that I found helpful, and we might want to include them in the dahak assembly workflow documentation (if this info isn't there already):

When running SPAdes and MEGAHIT, --12 indicates a file with interleaved paired-end reads.
Running metaSPAdes requires paired-end short reads, with or without a mix of Pac Bio or ONT long reads. Analysis of short, single-end reads is not currently supported by metaSPAdes.
It looks like dahak is using older versions of SPAdes, MEGAHIT, QUAST, and other tools. I'm not sure how significant that is for all of the tools, but it looks like SPAdes made improvements to their metagenome assembly capability (including the hybrid option) between versions 3.10.1 and 3.12.0. So if we’re sticking with the older version of SPAdes, then we’ll probably want to include a statement in the dahak 1.0 documentation about how future versions of dahak will integrate an updated version of SPAdes and more thoroughly test the hybrid assembly capability.

brooksph commented 6 years ago

Hi @kternus, yes this is correct. We plan to stick with these versions of the tools for the 1.0 release because we've verified that they work. Let's keep this issue open as a reminder to put that in the documentation.

kternus commented 6 years ago

Sounds good, @brooksph. Did you want to include khmer in the assembly workflow to remove any single-end reads that don't have pairs before running through metaSPAdes?

Here's a version with that:

_mondavi_illos_assembly_2018-06-27_1318

kternus commented 6 years ago

It looks like we're not using MetaQUAST, and we are using MultiQC:

assembly_v4

_mondavi_illos_assembly_2018-06-29_1120

brooksph commented 6 years ago

@kternus following the read filtering step unpaired reads are put into a separate file so we don't need to use khmer for that. We're using metaquast to generate assembly statistics and then compiling a visualizing the reports with multiQC. If we remove the khmer step and the reference to single or paired-end reads from your diagram we're good to go.

kternus commented 6 years ago

Thanks, @brooksph! I'll update the figure accordingly. We could also add more detail about the different file outputs from the read filtering workflow in the documentation/figure, if that would be helpful.

I didn't see metaquast.py in the code, so I thought we might not be using it. Glad we are!

kternus commented 6 years ago

assembly

brooksph commented 6 years ago

@kternus the reference to quast is in the assembly workflow pull request https://github.com/dahak-metagenomics/dahak/pull/92/files. There's a lot there but you find it in the workflows/assembly/Snakefile. Thanks again for putting the workflows together.

kternus commented 6 years ago

Yeah, I think the part I got confused by was why we're using quast.py:

docker run -v /${PWD}:/data -it quay.io/biocontainers/quast:4.5--boost1.61_1 \
quast.py /data/${file} -o /data/${base}.megahit_quast_report

Instead of metaquast.py:

docker run -v /${PWD}:/data -it quay.io/biocontainers/quast:4.5--boost1.61_1 \
metaquast.py /data/${file} -o /data/${base}.megahit_quast_report

And I was also wondering if these metaquast steps were happening within our workflow?

"If you run MetaQUAST without providing reference genomes, the tool will try to identify genome content of the metagenome. MetaQUAST uses BLASTN for aligning contigs to SILVA 16S rRNA database, i.e. FASTA file containing small subunit ribosomal RNA sequences. For each assembly, 50 reference genomes with top scores are chosen. Maximum number of references to download can be specified with --max-ref-number. Reference genomes for the chosen genomes are downloaded from the NCBI database to /quast_downloaded_references/. After that, MetaQUAST runs quast.py on all of them and removes reference genomes with low genome fraction (less than 10%) and proceeds the usual MetaQUAST analysis with the remaining references."
-http://quast.bioinf.spbau.ru/manual.html#sec2.5

brooksph commented 6 years ago

Yes, I agree that it is not clear. It's my understanding that we want to enable the end user to calculate assembly statistics and quast is fine for that, whether the input is a genome or metagenome. The metaquast approach introduces additional filtering steps and I'm not sure that's we want; however, it may be exactly what we want to do but it is not something I've thought a lot about. Maybe we should benchmark the two approaches. Do you have any experience using Metaquast?

kternus commented 6 years ago

It looks like the primary function of MetaQUAST (compared to normal QUAST) is to align the contigs to multiple reference genomes to generate all of the reference-dependent statistics that QUAST produces. I started making a list of those reference-dependent statistics below, and we'll probably want to make a note of this in our documentation.

For the Dahak 1.0 release, I think we should recommend that the reference-dependent statistics in the QUAST and MultiQC reports not be used for interpretation purposes. I doubt we'll be able to remove those statistics entirely from the reports, but we can at least make a note in the documentation about what assembly statistics we recommend Dahak 1.0 users look at or don't look at in these reports.

I think it was very nice of the MetaQUAST developers to offer an option to identify a set of appropriate reference genomes for their users, since it is unlikely that everyone analyzing metagenome assembly statistics knows exactly what organisms are in their sample. In future releases of Dahak (maybe 2.0?), I think we could use the species confidently identified by our taxonomic classification workflow as the set of references that MetaQUAST uses for its reference alignment-based statistics. I think that will improve its accuracy, compared to the 16S workflow that the MetaQUAST developers are offering. That's my two cents. :)

Reference length = total number of bases in the reference genome
Reference GC (%) = percentage of G and C nucleotides in the reference genome
NG50 = length for which the collection of all contigs of that length or longer covers at least half the reference genome. (This metric is computed only if the reference genome is provided.)
# misassemblies = the number of positions in the contigs (breakpoints) that satisfy one of the following criteria: the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference; flanking sequences overlap on more than 1 kbp; flanking sequences align to different strands or different chromosomes; flanking sequences align on different reference genomes (MetaQUAST only).
other metrics relating to misassemblies
metrics relating to unaligned contigs
Genome fraction (%) = percentage of aligned bases in the reference genome. A base in the reference genome is aligned if there is at least one contig with at least one alignment to this base.
Duplication ratio = total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome
# mismatches per 100 kbp = the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.
# indels per 100 kbp = the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.
Total aligned length = total number of aligned bases in the assembly. A value is usually smaller than a value of total length because some of the contigs may be unaligned or partially unaligned.
Largest alignment = length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled or partially unaligned.
NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75 ("A" stands for "aligned") = similar to the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered. Aligned blocks are obtained by breaking contigs at misassembly events and removing all unaligned bases.

kternus commented 6 years ago

@stephenturner is voting for figures with simple lines, so here's an updated version of the assembly workflow figure:

assembly

I also updated the comparison figure to include assembled contigs as a possible input type, which is something @brooksph requested:

comparison

stephenturner commented 6 years ago

Much more professional looking IMHO. Better data:ink ratio, less clutter. Nice work.

dahak-metagenomics / dahak

Assembly workflow questions #97