Closed ababaian closed 3 years ago
Excellent! Please deposit fastq, bam and sam if possible, thanks.
Suggest designating an S3 directory structure for contigs, something like contigs/method/SRA.fa e.g. contigs/rce/SRR7287110.fa so we can all find & review.
Just so we're not all trying the same things: I'm running metaviralSpades on all those datasets.
This is probably clear to all, but I wanted to emphasize just in case: The goal is to implement a fully automated containerized workflow. IMO it's fine to cut corners to do the benchmark test if that makes things easier, but manual curation steps, e.g. to separate host from virus, are out of bounds unless it's clear how to automate them. There should be a clear and short path from the method used to make the contigs to something we can run in the cloud.
@rchikhi Rats, you took mine! I guess I'll try to marshal some forces to try to do some metagenomic assembly voo-doo instead. @ababaian , is there a low-coverage CoV hit that comes from an experiment with multiple samples, that you might recommend?
The sus scrofa
series with IBV SRR10951656/SRR10951660 are part of that series, I took two examples but you can try that entire BioProject for this.
@ababaian Thanks! That looks perfect.
@rchikhi Are you planning on using Kraken2 or some other tool for trying to pare down the number of host reads? Or do you plan to assemble them all, and then identify contigs of interest? Or both? :-)
I'm doing an assembly of all, and then was planning to use CheckV (or others) to identify relevant contigs. This might not be the best strategy, but I'll just explore this route..
So, in terms of SPAdes, we believe that just SPAdes should work more or less straight out of the box. Probably single cell mode should be used to ensure that no assumptions about uniform coverage are made (though this mode also adds some code to clean up some MDA artifacts).
The reality is that the proper solution should be somewhere in the middle of metaSPAdes and metaviralSPAdes :)
I actually ended up running much more than MetaViralSPAdes. Here is a benchmark of MetaViralSpades, Megahit, Minia (with several parameters). An up-to-date version will be there: https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets
See https://github.com/ababaian/serratus/issues/130 for the motivation and description of the datasets.
Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs
.
A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/
.
Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv
that match the regexp [Cc]orona
.
MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta
(the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz
).
SRR10829953
: ~180K reads to KT323979.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_900205315.1 | 28530 | 101.4 | 97.55 |
Megahit | GCA_000913415.1 | 583 | 2.0 | 62.7 |
Minia-pipeline | GCA_900205315.1 | 24266 | 86.3 | 97.72 |
Minia-pipeline | GCA_900205315.1 | 2650 | 9.5 | 96.26 |
Minia k71 | GCA_900205315.1 | 21295 | 75.7 | 97.88 |
Minia k71 | GCA_900205315.1 | 2553 | 9.1 | 98.2 |
Minia k31 | GCA_900205315.1 | 24280 | 86.4 | 97.72 |
Minia k31 | GCA_900205315.1 | 1040 | 3.7 | 98.4 |
Minia k31 | GCA_900205315.1 | 1003 | 3.6 | 93.0 |
MetaViralSPAdes | GCA_900205315.1 | 18984 | 67.5 | 97.51 |
MetaViralSPAdes | GCA_900205315.1 | 8808 | 31.4 | 96.9 |
SRR10829957
: ~195K reads to KP728470.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_900205315.1 | 1548 | 5.5 | 95.4 |
Megahit | GCA_900205315.1 | 26185 | 93.1 | 97.65 |
Minia-pipeline | GCA_900205315.1 | 24272 | 86.3 | 97.72 |
Minia-pipeline | GCA_900205315.1 | 3650 | 13.0 | 96.42 |
Minia k71 | GCA_900205315.1 | 24243 | 86.2 | 97.72 |
Minia k71 | GCA_900205315.1 | 2363 | 8.4 | 98.2 |
Minia k71 | GCA_900205315.1 | 805 | 2.9 | 98.5 |
Minia k31 | GCA_900205315.1 | 17109 | 60.9 | 98.19 |
Minia k31 | GCA_900205315.1 | 5041 | 17.9 | 95.0 |
Minia k31 | GCA_900205315.1 | 1451 | 5.2 | 98.1 |
MetaViralSPades | GCA_900205315.1 | 28060 | 99.8 | 97.55 |
SRR10951656
: ~4x coverage to MH878976.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000880055.1 | 460 | 1.7 | 93.8 |
Megahit | GCA_000880055.1 | 569 | 2.1 | 95.2 |
Megahit | GCA_000880055.1 | 790 | 2.9 | 87.4 |
Minia-pipeline | GCA_000880055.1 | 206 | 0.7 | 84.4 |
Minia k31 | GCA_000880055.1 | 203 | 0.7 | 92.2 |
Minia k31 | GCA_000862965.1 | 343 | 1.2 | 96.5 |
Minia k31 | GCA_000880055.1 | 255 | 0.9 | 85.0 |
SRR10951660
: ~1x coverage to MH878976.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000862965.1 | 346 | 1.3 | 96.5 |
Megahit | GCA_000880055.1 | 634 | 2.3 | 95.7 |
SRR1194066
: ~16K read coverage to KF600647.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000901155.1 | 7601 | 25.2 | 99.18 |
Minia-pipeline | GCA_000901155.1 | 6973 | 23.2 | 99.1 |
Minia k71 | GCA_000901155.1 | 2502 | 8.3 | 100.0 |
Minia k71 | GCA_000901155.1 | 760 | 2.5 | 100.0 |
Minia k71 | GCA_000901155.1 | 285 | 1.0 | 100.0 |
Minia k31 | GCA_000901155.1 | 5691 | 19.0 | 100.0 |
MetaViralSPAdes | GCA_000901155.1 | 3180 | 10.6 | 100.0 |
MetaViralSPAdes | GCA_000901155.1 | 2952 | 9.8 | 100.0 |
ERR2756788
: Frank, ~8K mapped coverage, closest hit is fragment EU769558.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000872845.1 | 1998 | 6.3 | 32.2 |
Megahit | GCA_003972065.1 | 29219 | 105.2 | 55.71 |
Minia-pipeline | GCA_001503155.1 | 26412 | 95.1 | 56.18 |
Minia-pipeline | GCA_003972065.1 | 2778 | 9.7 | 47.6 |
Minia-pipeline | GCA_000872845.1 | 1801 | 5.6 | 32.2 |
Minia k71 | GCA_000899495.1 | 7104 | 25.3 | 48.61 |
Minia k71 | GCA_000899495.1 | 5044 | 17.7 | 80.3 |
Minia k31 | GCA_003972065.1 | 28908 | 104.1 | 55.71 |
Minia k31 | GCA_000872845.1 | 798 | 2.5 | 32.2 |
MetaViralSPAdes | GCA_000899495.1 | 26406 | 95.1 | 55.75 |
MetaViralSPAdes | GCA_003972065.1 | 1337 | 4.7 | 46.9 |
MetaViralSPAdes | GCA_000872845.1 | 1185 | 3.7 | 32.2 |
SRR7287110
: Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000856025.1 | 26905 | 95.1 | 85.3 |
Megahit | GCA_000856025.1 | 1995 | 6.9 | 90.89 |
Megahit | GCA_000856025.1 | 2598 | 9.0 | 91.14 |
Megahit | GCA_000870985.1 | 3972 | 14.4 | 37.0 |
Minia-pipeline | GCA_000856025.1 | 11269 | 39.5 | 79.61 |
Minia-pipeline | GCA_000856025.1 | 6024 | 20.9 | 91.23 |
Minia-pipeline | GCA_000856025.1 | 1058 | 3.7 | 68.1 |
Minia-pipeline | GCA_000856025.1 | 2288 | 7.9 | 90.61 |
Minia-pipeline | GCA_000856025.1 | 2291 | 7.9 | 91.14 |
Minia k71 | GCA_000856025.1 | 10075 | 35.0 | 78.89 |
Minia k71 | GCA_000856025.1 | 1000 | 3.4 | 89.4 |
Minia k71 | GCA_000856025.1 | 843 | 3.0 | 90.8 |
Minia k71 | GCA_000856025.1 | 846 | 3.0 | 90.8 |
Minia k71 | GCA_000856025.1 | 666 | 2.3 | 97.8 |
Minia k71 | GCA_000856025.1 | 666 | 2.3 | 94.6 |
Minia k31 | GCA_000856025.1 | 4902 | 17.2 | 96.1 |
Minia k31 | GCA_000856025.1 | 1891 | 6.5 | 92.4 |
Minia k31 | GCA_000856025.1 | 1368 | 4.8 | 87.08 |
Minia k31 | GCA_000856025.1 | 832 | 3.0 | 95.5 |
MetaViralSPAdes | GCA_000856025.1 | 307 | 1.1 | 68.1 |
MetaViralSPAdes | GCA_000856025.1 | 307 | 1.1 | 72.1 |
MetaViralSPAdes | GCA_001504755.1 | 2879 | 10.4 | 36.0 |
MetaViralSPAdes | GCA_000856025.1 | 1064 | 3.6 | 91.4 |
MetaViralSPAdes | GCA_000856025.1 | 2201 | 7.6 | 91.8 |
MetaViralSPAdes | GCA_000856025.1 | 1217 | 4.2 | 92.1 |
MetaViralSPAdes | GCA_000856025.1 | 899 | 3.1 | 88.4 |
MetaViralSPAdes | GCA_000856025.1 | 8785 | 30.5 | 95.9 |
MetaViralSPAdes | GCA_000856025.1 | 858 | 3.0 | 95.5 |
MetaViralSPAdes | GCA_000856025.1 | 9000 | 31.4 | 91.3 |
SRR1168901
Nothing was found, apparently.
On 4 threads, dataset SRR10829957
(4 GB compressed)
Method | Time | Memory (GB) |
---|---|---|
Megahit | 2h59m | 6.8 |
Minia-pipeline | 2h | 3.8 |
Minia k61 | 24m | 1.7 |
Minia k31 | 34m | 1.7 |
MetaviralSpades | 18h21m | 21 |
There are 3 types of datasets: 1) (easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome 2) (hard instances) and those where the coverage is too low and reference-based assembly will be needed. 3) (assembler-critical instances) on those, only some assemblers do well with default parameters
Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110
is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.
PRICE: elapsed times are ~1hr with 4 threads. Longest contigs: Frank ERR2756788 8504 Ginger SRR7287110 15691 Pedv1 SRR10829953 23559 Pedv2 SRR10829957 27949 Lowcvg SRR10951656 238 (total of 628 contigs) Fully automated, no attempt to tune, no metadata needed except for paired vs. unpaired and read length
I'm checking some odd annotation results in our co-assembly. @ababaian , can you help me understand which version (i.e., generated on which date) of which flavor of reference DB (e.g., cov0, cov2, cov3m, etc.) was used for each of the runs referenced in the benchmark above?
A little bit outdated by this point but this is all from the zoonotic reservoir dataset which began on 200505. A small sub-set of that data is aligned to cov2r (only Pan-Coronavirus) and the majority of the data is aligned to cov2m
which is flom1, or a sub-set of vertebrate viruses. Essentially as we were learning about this data-set we were going through major changes of our reference genome, it will all have to be repeated soon anyways, I can do this on the weekend.
For completeness, here are the results from the metagenomic co-assembly test:
https://github.com/ababaian/serratus/wiki/Metagenomic-Co-Assembly-Proof-of-Concept
Can you add "Conclusions"? Is this something you would recommend including in production, or not? Seems like a good idea, but maybe didn't work so well in practice?
I believe the key point in co-assembly approach there was the removal of host reads. Otherwise the size of the data would be enormous
Maybe I misunderstood, I thought co-assembly was combining reads from different SRAs which probably have the same virus. Naively, this would make sense when coverage is low. I thought they used checkv as the final step to separate out the host reads?
@rcedgar I don't know why you say it doesn't work so well in practice. We recovered a full-length IBV CoV genome from the "low abundance" IBV study, whereas using other assemblers a fraction of that at 790 bp max was recovered (see @rchikhi 's results above) . I think it worked very well, as I described on the call.
Yes, it takes ~4x as long as the Minia assembler, so I don't recommend it for everything. I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome. And the Hallam Lab has graciously volunteered a significant amount of compute to do that for the Serratus Project, so AWS cost budgeting is not an issue. I just need a list of the SRAs where we didn't recover a full sequence. There's no reason not to do this.
Sorry for the misunderstanding -- it was a question, not a statement, I was hoping you could summarize the conclusions. "I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome" is exactly what I was looking for, thanks!
Correction: @rchikhi 's results show that 790 bp were recovered, not 5kbp. Corrected above.
Sorry if I misread your comment. I'll make sure to mirror the conclusions that I've written here to the Wiki report. Please leave this open until that documentation is complete, feel free to reassign to me.
@asl Is correct, we filter reads up-front to remove host reads using BMTagger. We will probably move to Kraken in the future. We used CheckV at the end to check for completeness of the recovered contigs.
Got it thanks. And general apologies to all for not RTFM'ing everything, I'm skimming a lot in an attempt to maximize my own productivity.
For the sake of completeness, SPAdes recovered 25 kbp of contigs from single run (though, max contig was only 2 kbp there) and we obtained 27 kbp from 2 SRAs. Here is the QUAST report:
Assembly corona_pe2
# contigs (>= 0 bp) 13
# contigs (>= 1000 bp) 10
# contigs (>= 5000 bp) 0
# contigs (>= 10000 bp) 0
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp) 27648
Total length (>= 1000 bp) 26277
Total length (>= 5000 bp) 0
Total length (>= 10000 bp) 0
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
# contigs 11
Largest contig 4836
Total length 26987
GC (%) 38.31
N50 2824
N75 1657
L50 4
L75 7
# N's per 100 kbp 0.00
Thanks @asl. Did the metaviralSPAdes tools declare all of those contigs to be viral in origin?
This was not a metaviralSPAdes run as it is unsuitable for RNA viruses. But I will check with viralVerify.
@taltman viralVerify classified majority of these contigs as viral. Also, in the full assembly it shows many other viral contigs as well, I'm cross-checking with checkv.
Yes, there is, for example:
NODE_2_length_15157_cov_8.893458
which is likely the same as (judging from the length and # of genes)
k119_5808 MH996950.1 11176 Avian avulavirus 1 15205 15147 15138 99.908 99 27237 0.0
as reported in co-assembly results
I've added coronaSPAdes results to https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets
commentary: looks real good, especially on Ginger, the best method so far.
For some reason, in combination with CheckV it didn't return anything on the very-low-coverage datasets, but in some cases the gene_clusters.fasta
contained a few hundreds of sequences bases. Maybe lack of sensitivity from CheckV.
@rchikhi The assemblies are fragmented on low coverage datasets. So we don't have strong matches for HMM-based approaches. And checkv had some thresholds as well (I believe it won't report anything useful to contigs < 4 kbp)
sounds fair! well, CheckV did manage to get hits on small (200-300bp) Megahit/minia contigs. But this is anyway a minor issue, that will be resolved by coassembly.
These are the SRA accessions which are our 'representative' set. Based on the assemblies / contigs generated by each method we will decide on an implementation to use on June 3rd. Which means we should have assemblies for these done by June 1st the latest such that we can compare results.
Delivery
Completion of each benchmark should contain as much of this information as possible. All data should be uploaded to ~
s3://serratus-public/notebook/200526_assembly/<your_initials>/
~/notebook.md
/.ipynb
/.make
: A Rough script/procedure for exactly how the assembly was made (QC, software/versions etc...)time
on an 8-core ~64 GB RAM machine (be sure to start count from post-unzip)~/contigs/
Viral contig sequence output~/metrics/
Any assembly metrics associated with outputThe Dataset
To download the raw fq files and the
cov3m
aligned bam files (output of Serratus).fq data:
aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/fq/ ./
bam data:aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/bam/ ./
(includes cov3m reference sequence)High Coverage Porcine Epidemic Virus in Sus Scrofa
SRR10829953
: ~180K reads to KT323979.1SRR10829957
: ~195K reads to KP728470.1Low Coverage Infectious Bronchitis Virus in Sus Scrofa
SRR10951656
: ~4x coverage to MH878976.1SRR10951660
: ~1x coverage to MH878976.1Experimental SARS/MERS infection of Vero Cell Line
SRR1194066
: ~16K read coverage to KF600647.1SRR1168901
:~2.6K reads low coverage to KF600647.1Frank - Unidentified Bat Alphacoronavirus
ERR2756788
: ~8K mapped coverage, closest hit is fragment EU769558.1Ginger - Unidentified Feline Coronavirus
SRR7287110
: ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1I will download all of these accessions and host them in the S3 bucket so we can access them quickly and not have to go through the SRA each time. Give me an hour or two.