ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

~~> The Assembly Benchmark <~~ #130

Closed ababaian closed 3 years ago

ababaian commented 4 years ago

These are the SRA accessions which are our 'representative' set. Based on the assemblies / contigs generated by each method we will decide on an implementation to use on June 3rd. Which means we should have assemblies for these done by June 1st the latest such that we can compare results.

Delivery

Completion of each benchmark should contain as much of this information as possible. All data should be uploaded to ~ s3://serratus-public/notebook/200526_assembly/<your_initials>/

The Dataset

To download the raw fq files and the cov3m aligned bam files (output of Serratus).

fq data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/fq/ ./ bam data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/bam/ ./ (includes cov3m reference sequence)

High Coverage Porcine Epidemic Virus in Sus Scrofa

Low Coverage Infectious Bronchitis Virus in Sus Scrofa

Experimental SARS/MERS infection of Vero Cell Line

Frank - Unidentified Bat Alphacoronavirus

Ginger - Unidentified Feline Coronavirus

rcedgar commented 4 years ago

Excellent! Please deposit fastq, bam and sam if possible, thanks.

rcedgar commented 4 years ago

Suggest designating an S3 directory structure for contigs, something like contigs/method/SRA.fa e.g. contigs/rce/SRR7287110.fa so we can all find & review.

rchikhi commented 4 years ago

Just so we're not all trying the same things: I'm running metaviralSpades on all those datasets.

rcedgar commented 4 years ago

This is probably clear to all, but I wanted to emphasize just in case: The goal is to implement a fully automated containerized workflow. IMO it's fine to cut corners to do the benchmark test if that makes things easier, but manual curation steps, e.g. to separate host from virus, are out of bounds unless it's clear how to automate them. There should be a clear and short path from the method used to make the contigs to something we can run in the cloud.

taltman commented 4 years ago

@rchikhi Rats, you took mine! I guess I'll try to marshal some forces to try to do some metagenomic assembly voo-doo instead. @ababaian , is there a low-coverage CoV hit that comes from an experiment with multiple samples, that you might recommend?

ababaian commented 4 years ago

The sus scrofa series with IBV SRR10951656/SRR10951660 are part of that series, I took two examples but you can try that entire BioProject for this.

taltman commented 4 years ago

@ababaian Thanks! That looks perfect.

@rchikhi Are you planning on using Kraken2 or some other tool for trying to pare down the number of host reads? Or do you plan to assemble them all, and then identify contigs of interest? Or both? :-)

rchikhi commented 4 years ago

I'm doing an assembly of all, and then was planning to use CheckV (or others) to identify relevant contigs. This might not be the best strategy, but I'll just explore this route..

asl commented 4 years ago

So, in terms of SPAdes, we believe that just SPAdes should work more or less straight out of the box. Probably single cell mode should be used to ensure that no assumptions about uniform coverage are made (though this mode also adds some code to clean up some MDA artifacts).

The reality is that the proper solution should be somewhere in the middle of metaSPAdes and metaviralSPAdes :)

rchikhi commented 4 years ago

I actually ended up running much more than MetaViralSPAdes. Here is a benchmark of MetaViralSpades, Megahit, Minia (with several parameters). An up-to-date version will be there: https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets

Assembly benchmark results for 8 coronavirus candidates datasets

See https://github.com/ababaian/serratus/issues/130 for the motivation and description of the datasets.

Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs.

A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/.

Benchmark setup

Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv that match the regexp [Cc]orona.

MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta (the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz).

Results

SRR10829953 : ~180K reads to KT323979.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 28530 101.4 97.55
Megahit GCA_000913415.1 583 2.0 62.7
Minia-pipeline GCA_900205315.1 24266 86.3 97.72
Minia-pipeline GCA_900205315.1 2650 9.5 96.26
Minia k71 GCA_900205315.1 21295 75.7 97.88
Minia k71 GCA_900205315.1 2553 9.1 98.2
Minia k31 GCA_900205315.1 24280 86.4 97.72
Minia k31 GCA_900205315.1 1040 3.7 98.4
Minia k31 GCA_900205315.1 1003 3.6 93.0
MetaViralSPAdes GCA_900205315.1 18984 67.5 97.51
MetaViralSPAdes GCA_900205315.1 8808 31.4 96.9

SRR10829957 : ~195K reads to KP728470.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 1548 5.5 95.4
Megahit GCA_900205315.1 26185 93.1 97.65
Minia-pipeline GCA_900205315.1 24272 86.3 97.72
Minia-pipeline GCA_900205315.1 3650 13.0 96.42
Minia k71 GCA_900205315.1 24243 86.2 97.72
Minia k71 GCA_900205315.1 2363 8.4 98.2
Minia k71 GCA_900205315.1 805 2.9 98.5
Minia k31 GCA_900205315.1 17109 60.9 98.19
Minia k31 GCA_900205315.1 5041 17.9 95.0
Minia k31 GCA_900205315.1 1451 5.2 98.1
MetaViralSPades GCA_900205315.1 28060 99.8 97.55

SRR10951656 : ~4x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000880055.1 460 1.7 93.8
Megahit GCA_000880055.1 569 2.1 95.2
Megahit GCA_000880055.1 790 2.9 87.4
Minia-pipeline GCA_000880055.1 206 0.7 84.4
Minia k31 GCA_000880055.1 203 0.7 92.2
Minia k31 GCA_000862965.1 343 1.2 96.5
Minia k31 GCA_000880055.1 255 0.9 85.0

SRR10951660 : ~1x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000862965.1 346 1.3 96.5
Megahit GCA_000880055.1 634 2.3 95.7

SRR1194066 : ~16K read coverage to KF600647.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000901155.1 7601 25.2 99.18
Minia-pipeline GCA_000901155.1 6973 23.2 99.1
Minia k71 GCA_000901155.1 2502 8.3 100.0
Minia k71 GCA_000901155.1 760 2.5 100.0
Minia k71 GCA_000901155.1 285 1.0 100.0
Minia k31 GCA_000901155.1 5691 19.0 100.0
MetaViralSPAdes GCA_000901155.1 3180 10.6 100.0
MetaViralSPAdes GCA_000901155.1 2952 9.8 100.0

ERR2756788 : Frank, ~8K mapped coverage, closest hit is fragment EU769558.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000872845.1 1998 6.3 32.2
Megahit GCA_003972065.1 29219 105.2 55.71
Minia-pipeline GCA_001503155.1 26412 95.1 56.18
Minia-pipeline GCA_003972065.1 2778 9.7 47.6
Minia-pipeline GCA_000872845.1 1801 5.6 32.2
Minia k71 GCA_000899495.1 7104 25.3 48.61
Minia k71 GCA_000899495.1 5044 17.7 80.3
Minia k31 GCA_003972065.1 28908 104.1 55.71
Minia k31 GCA_000872845.1 798 2.5 32.2
MetaViralSPAdes GCA_000899495.1 26406 95.1 55.75
MetaViralSPAdes GCA_003972065.1 1337 4.7 46.9
MetaViralSPAdes GCA_000872845.1 1185 3.7 32.2

SRR7287110 : Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000856025.1 26905 95.1 85.3
Megahit GCA_000856025.1 1995 6.9 90.89
Megahit GCA_000856025.1 2598 9.0 91.14
Megahit GCA_000870985.1 3972 14.4 37.0
Minia-pipeline GCA_000856025.1 11269 39.5 79.61
Minia-pipeline GCA_000856025.1 6024 20.9 91.23
Minia-pipeline GCA_000856025.1 1058 3.7 68.1
Minia-pipeline GCA_000856025.1 2288 7.9 90.61
Minia-pipeline GCA_000856025.1 2291 7.9 91.14
Minia k71 GCA_000856025.1 10075 35.0 78.89
Minia k71 GCA_000856025.1 1000 3.4 89.4
Minia k71 GCA_000856025.1 843 3.0 90.8
Minia k71 GCA_000856025.1 846 3.0 90.8
Minia k71 GCA_000856025.1 666 2.3 97.8
Minia k71 GCA_000856025.1 666 2.3 94.6
Minia k31 GCA_000856025.1 4902 17.2 96.1
Minia k31 GCA_000856025.1 1891 6.5 92.4
Minia k31 GCA_000856025.1 1368 4.8 87.08
Minia k31 GCA_000856025.1 832 3.0 95.5
MetaViralSPAdes GCA_000856025.1 307 1.1 68.1
MetaViralSPAdes GCA_000856025.1 307 1.1 72.1
MetaViralSPAdes GCA_001504755.1 2879 10.4 36.0
MetaViralSPAdes GCA_000856025.1 1064 3.6 91.4
MetaViralSPAdes GCA_000856025.1 2201 7.6 91.8
MetaViralSPAdes GCA_000856025.1 1217 4.2 92.1
MetaViralSPAdes GCA_000856025.1 899 3.1 88.4
MetaViralSPAdes GCA_000856025.1 8785 30.5 95.9
MetaViralSPAdes GCA_000856025.1 858 3.0 95.5
MetaViralSPAdes GCA_000856025.1 9000 31.4 91.3

SRR1168901

Nothing was found, apparently.

Performance

On 4 threads, dataset SRR10829957 (4 GB compressed)

Method Time Memory (GB)
Megahit 2h59m 6.8
Minia-pipeline 2h 3.8
Minia k61 24m 1.7
Minia k31 34m 1.7
MetaviralSpades 18h21m 21

Some thoughts

There are 3 types of datasets: 1) (easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome 2) (hard instances) and those where the coverage is too low and reference-based assembly will be needed. 3) (assembler-critical instances) on those, only some assemblers do well with default parameters

Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110 is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.

rcedgar commented 4 years ago

PRICE: elapsed times are ~1hr with 4 threads. Longest contigs: Frank ERR2756788 8504 Ginger SRR7287110 15691 Pedv1 SRR10829953 23559 Pedv2 SRR10829957 27949 Lowcvg SRR10951656 238 (total of 628 contigs) Fully automated, no attempt to tune, no metadata needed except for paired vs. unpaired and read length

taltman commented 4 years ago

I'm checking some odd annotation results in our co-assembly. @ababaian , can you help me understand which version (i.e., generated on which date) of which flavor of reference DB (e.g., cov0, cov2, cov3m, etc.) was used for each of the runs referenced in the benchmark above?

ababaian commented 4 years ago

A little bit outdated by this point but this is all from the zoonotic reservoir dataset which began on 200505. A small sub-set of that data is aligned to cov2r (only Pan-Coronavirus) and the majority of the data is aligned to cov2m which is flom1, or a sub-set of vertebrate viruses. Essentially as we were learning about this data-set we were going through major changes of our reference genome, it will all have to be repeated soon anyways, I can do this on the weekend.

taltman commented 4 years ago

For completeness, here are the results from the metagenomic co-assembly test:

https://github.com/ababaian/serratus/wiki/Metagenomic-Co-Assembly-Proof-of-Concept

rcedgar commented 4 years ago

Can you add "Conclusions"? Is this something you would recommend including in production, or not? Seems like a good idea, but maybe didn't work so well in practice?

asl commented 4 years ago

I believe the key point in co-assembly approach there was the removal of host reads. Otherwise the size of the data would be enormous

rcedgar commented 4 years ago

Maybe I misunderstood, I thought co-assembly was combining reads from different SRAs which probably have the same virus. Naively, this would make sense when coverage is low. I thought they used checkv as the final step to separate out the host reads?

taltman commented 4 years ago

@rcedgar I don't know why you say it doesn't work so well in practice. We recovered a full-length IBV CoV genome from the "low abundance" IBV study, whereas using other assemblers a fraction of that at 790 bp max was recovered (see @rchikhi 's results above) . I think it worked very well, as I described on the call.

Yes, it takes ~4x as long as the Minia assembler, so I don't recommend it for everything. I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome. And the Hallam Lab has graciously volunteered a significant amount of compute to do that for the Serratus Project, so AWS cost budgeting is not an issue. I just need a list of the SRAs where we didn't recover a full sequence. There's no reason not to do this.

rcedgar commented 4 years ago

Sorry for the misunderstanding -- it was a question, not a statement, I was hoping you could summarize the conclusions. "I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome" is exactly what I was looking for, thanks!

taltman commented 4 years ago

Correction: @rchikhi 's results show that 790 bp were recovered, not 5kbp. Corrected above.

taltman commented 4 years ago

Sorry if I misread your comment. I'll make sure to mirror the conclusions that I've written here to the Wiki report. Please leave this open until that documentation is complete, feel free to reassign to me.

taltman commented 4 years ago

@asl Is correct, we filter reads up-front to remove host reads using BMTagger. We will probably move to Kraken in the future. We used CheckV at the end to check for completeness of the recovered contigs.

rcedgar commented 4 years ago

Got it thanks. And general apologies to all for not RTFM'ing everything, I'm skimming a lot in an attempt to maximize my own productivity.

asl commented 4 years ago

For the sake of completeness, SPAdes recovered 25 kbp of contigs from single run (though, max contig was only 2 kbp there) and we obtained 27 kbp from 2 SRAs. Here is the QUAST report:

Assembly                    corona_pe2
# contigs (>= 0 bp)         13
# contigs (>= 1000 bp)      10
# contigs (>= 5000 bp)      0
# contigs (>= 10000 bp)     0
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      27648
Total length (>= 1000 bp)   26277
Total length (>= 5000 bp)   0
Total length (>= 10000 bp)  0
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   11
Largest contig              4836
Total length                26987
GC (%)                      38.31
N50                         2824
N75                         1657
L50                         4
L75                         7
# N's per 100 kbp           0.00
taltman commented 4 years ago

Thanks @asl. Did the metaviralSPAdes tools declare all of those contigs to be viral in origin?

asl commented 4 years ago

This was not a metaviralSPAdes run as it is unsuitable for RNA viruses. But I will check with viralVerify.

asl commented 4 years ago

@taltman viralVerify classified majority of these contigs as viral. Also, in the full assembly it shows many other viral contigs as well, I'm cross-checking with checkv.

asl commented 4 years ago

Yes, there is, for example:

NODE_2_length_15157_cov_8.893458

which is likely the same as (judging from the length and # of genes)

k119_5808       MH996950.1      11176   Avian avulavirus 1      15205   15147   15138   99.908  99      27237   0.0

as reported in co-assembly results

rchikhi commented 4 years ago

I've added coronaSPAdes results to https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets

rchikhi commented 4 years ago

commentary: looks real good, especially on Ginger, the best method so far.

For some reason, in combination with CheckV it didn't return anything on the very-low-coverage datasets, but in some cases the gene_clusters.fasta contained a few hundreds of sequences bases. Maybe lack of sensitivity from CheckV.

asl commented 4 years ago

@rchikhi The assemblies are fragmented on low coverage datasets. So we don't have strong matches for HMM-based approaches. And checkv had some thresholds as well (I believe it won't report anything useful to contigs < 4 kbp)

rchikhi commented 4 years ago

sounds fair! well, CheckV did manage to get hits on small (200-300bp) Megahit/minia contigs. But this is anyway a minor issue, that will be resolved by coassembly.