ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

State of Assembly #86

Closed ababaian closed 4 years ago

ababaian commented 4 years ago

So here's the general outline of the two approaches and the drawbacks of each for us to discuss at Friday's Assembly meeting. I think the approach will be for us to try variations, see how they perform objectively and we can decide the winning strategy in the end.

This is only meant as a starting point for discussion such that we can separate out the ways this can be done.

Outline Diagram

serratus_assembly_discuss

1) Reference or Targeted Assembly

Using the serratus output bam files containing the set of reads which which map to the pan-genome and their unmapped pairs we assemble known CoV hits into contigs.

a) Consensus Sequence Method.

This can be done by re-aligning all "hits" to it's closest known complete genome as a scaffold. Any regions which are not covered are converted to NNNNs and variants in the reads (vcf?) are applied to the reference sequence. The VCF and assembled Fasta files are delivered as output.

b) Assembly Method.

De novo assembly is performed using all the reads in the serratus output bam, ignoring the alignment, these contigs are then BLAST'd against all known sequences. Output is assembled contigs and the BLAST hits for each contig.

2) De novo assembly

The serratus output bam are used to prioritize "hit" libraries, you return to the initial SRA files (all reads) and perform de novo assembly. This will create many non-viral (host or other organisms) contigs. All output contigs will need to be classified as either CoV(/viral) and non-viral. We can retain all contigs since they are built already, but CoV contigs are prioritized. Output is CoV contigs, other contigs, BLAST hit information for all contigs.


Delivery / Implementation

Regardless of implementation, the final output will be single container workflow (thus can be ported to any HPC). The most likely strategy for implementation into AWS will be to use either 1) Fargate deployment of a container or 2) AWS Lambda/Batch. Either case we will create an automated "Launch", such that anytime the core Serratus workflow identifies a potential hit, we can automate the start of a fully independent Assembly pipeline. All output data will go into the same CoV Database under it's own folder (i.e. s3://serratus.io/data/contig/SRRXXX.fa , s3://serratus.io/data/assembly/SRRXXXX.blasthit, ...)

Cost/CPU use is not a factor. We anticipate that will need to perform this at high quality only a few thousand times.

Related issues

rcedgar commented 4 years ago

image

JustinChu commented 4 years ago

Mockup of potential pipeline for a targetted approach that tries to Serratus Potential Targetted Assenbly pipeline This is just a rough mockup of a targetted de novo assembly pipeline that goes back to the original reads to find more reads for assembly.

A full de novo assembly of the genome is not only costly, but may actually decrease the quality of the assembled desired contigs by interfering with them (altered global/local coverage assumptions, false connections, etc.). There is also a higher chance of chimeric contigs the more host DNA is present during assembly. However, this assembling the whole dataset may be the most straight forward to being with as there are many well written metagenomic or transcriptomic assemblers that may work well on most datasets.

There does seem to be some host DNA in the bam files I've been assembling (SRR1082995X, mostly sus scofa), and the need for negative filters might be important especially when scaffolding. I've found the metadata of the datasets to be helpful but we could consider a run of a metagenomic classifier to find the most obvious host sequence to filter out.

The scaffolding using closest genome is not necessary but should be considered if a de novo approach is used.

ababaian commented 4 years ago

There's some non-CoV sequence like a host gene or in the pan-genome, we haven't completed a full "blacklist" yet. There's also plasmid RNA in a bunch of libraries too that map to other sequences. It's an interative process to clean this up.

ekg commented 4 years ago

There's also plasmid RNA in a bunch of libraries too that map to other sequences.

I've also seen this, especially in illumina data.

sjackman commented 4 years ago

Mash Screen seems relevant. Mash Screen: high-throughput sequence containment estimation for genome discovery https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x See the paragraph Novel virus assembly

Mash Screen is implemented in C++ and is integrated into the existing Mash codebase as of v2.0.

https://github.com/marbl/Mash

sjackman commented 4 years ago

My suggestion…

  1. Screen all SRA RNA-Seq datasets to identify datasets that contain known coronavirus k-mers @rchikhi REINDEER does this? https://covid19seqsearch.pasteur.cloud
  2. Assemble those matching datasets de novo using an off-the-shelf RNA-Seq assembler intended for human RNA-Seq assembly
  3. Identify contigs that are likely coronavirus by mapping all of the contigs to a reference of known complete coronavirus sequences (perhaps using minimap2, Diamond, or some other aligner)
  4. If the contig is full length, done! If not, perhaps reference-based scaffolding and possibly gap filling
rcedgar commented 4 years ago

@sjackman thanks for the comments!

  1. We are using bowtie2 to align to known Cov nt sequences. Much more sensitive, this is working well.
  1. Makes sense, which one would you recommend?

  2. We're already using bowtie2, which is more sensitive than minimap2. This will miss more diverged Covs, hence my suggestion to use translated blast. Diamond looks like a good blast alternative if it turns out we need to optimize compute time.

  3. Right, but how exactly?

sjackman commented 4 years ago
  1. Makes sense, which one would you recommend?

RNA-Seq assembly is not my area of expertise, so I'll defer to someone else here. rnaSPAdes (http://cab.spbu.ru/software/rnaspades/) comes to mind. The RNA-Seq assembler from my previous lab is RNA-Bloom (https://github.com/bcgsc/RNA-Bloom)

  1. This will miss more diverged Covs, hence my suggestion to use translated blast.

For translated BLAST you may consider DIMAOND BLASTX (http://www.diamondsearch.org)

  1. Right, but how exactly?

I haven't done a lot of reference guided scaffolding. The tool from my previous lab for this purpose is ntJoin (https://github.com/bcgsc/ntjoin). The paper compares to Ragout and Ragoo. https://doi.org/10.1093/bioinformatics/btaa253

ekg commented 4 years ago

I would suggest mmseqs2 for protein to translated nucleotide searches.

On Tue, May 26, 2020, 19:58 Shaun Jackman notifications@github.com wrote:

  1. Makes sense, which one would you recommend?

RNA-Seq assembly is not my area of expertise, so I'll defer to someone else here. rnaSPAdes (http://cab.spbu.ru/software/rnaspades/) comes to mind. The RNA-Seq assembler from my previous lab is RNA-Bloom ( https://github.com/bcgsc/RNA-Bloom)

  1. This will miss more diverged Covs, hence my suggestion to use translated blast.

For translated BLAST you may consider DIMAOND BLASTX ( http://www.diamondsearch.org)

  1. Right, but how exactly?

I haven't done a lot of reference guided scaffolding. The tool from my previous lab for this purpose is ntJoin (https://github.com/bcgsc/ntjoin). The paper compares to Ragout and Ragoo. https://doi.org/10.1093/bioinformatics/btaa253

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ababaian/serratus/issues/86#issuecomment-634181573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKDFRY4A6W3F3ETMB3RTP7LZANCNFSM4NAIEBKA .

taltman commented 4 years ago

@sjackman The SPAdes team specifically recommended against rnaSPAdes for this scenario: https://github.com/ablab/spades/issues/516

taltman commented 4 years ago

@sjackman I think we don't need to reinvent the wheel for your # 3 point. I think metaviralSPAdes and CheckV have already laid out some of the groundwork. We just might need to tweak it for CoVs: https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1 https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa490/5837667

ababaian commented 4 years ago

The Benchmark Data for Testing ( #130 ) is now available. I think we've hit the limit of theorycrafting and it's time to see how each of these approaches function when the rubber hits the road.

rcedgar commented 4 years ago

Superseded by Assembler Benchmark #130? Can we close this?