ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

List of assemblers -- please contribute! #71

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

I started a list of assemblers that might be useful here:

doc/assemblers.md

Please edit / send me updates if you have ideas.

hussius commented 4 years ago

Since I don't seem to have push access to this repo, I'll add some suggestions as a comment below this one. Just to give a bit of context - my comments and recommendations stem from a project about assembling RNA and DNA virus genomes from mixed samples (host + virus) that I was involved in around 2015-2016. In terms of the tools mentioned in the review paper which is linked from the doc/assemblers.md document, I've tried more than half. (Specifically, CLC, IDBA-UD, Megahit, MetaVelvet, Mira, SPAdes, Velvet, Vicuna; I've also used Ray Meta but only for bacterial metagenome assembly, where it worked well). The ones I ended up using the most were SPAdes (which is also highlighted in the paper as a consistently well-performing tool, although they used the SPAdes meta variety whereas I just used SPAdes) and IDBA-UD (which the review paper authors also seem to like). I also quite liked Megahit, which has low memory requirements, but did not end up using that simply because it was not appreciably better in terms of assembly quality than SPAdes and IDBA-UD, which were already in our pipeline.

All just my two cents of course!

hussius commented 4 years ago

SPAdes Type: De-novo genomic / transcriptomic / metagenomic (different varieties exist - rnaSPAdes, SPAdes meta etc.) Code: https://github.com/ablab/spades Paper: doi: 10.1089/cmb.2012.0021 Comments: Well-supported and generally robust assembler. SPAdes meta was highlighted in the review article at the top of the document ("Choice of assembly software has a critical impact on virome characterisation") as performing "consistently well".

Megahit Type: De-novo genomic / metagenomic Code: https://github.com/voutcn/megahit Paper: doi: 10.1093/bioinformatics/btv033 Comments: Very memory-efficient.

IDBA Type: De-novo metagenomic Code: https://github.com/loneknightpy/idba Paper: doi: 10.1093/bioinformatics/bts174 Comments: Anecdotally (i.e. in my own experience) works well for viral genome assembly. Also positively reviewed in the review paper above.

hussius commented 4 years ago

@rcedgar Do you know what hardware you will be trying to run the assembly on? (If AWS instances, what size RAM etc.)

rcedgar commented 4 years ago

@hussius We can use any instance type we need, smaller and cheaper are preferred but Amazon are donating credits and we have a lot of flexibility. If you have expertise here, would you be up for doing some testing ASAP? The key question is whether an existing de novo assembler can handle a dataset under these conditions:

(+) Host reads are a large majority, virus is a small fraction.

(+) Host is a species where we don't have a finished genome, an obscure bat or whatever.

(+) Virus is only recognized from a fragment, so we cannot use known virus genomes as a positive filter. (Edit added by RCE).

Even the assembler generates virus contigs, how do we filter out the host? If the virus is closely related to a known strain, this may be straightforward, but otherwise it could be challenging.

charlescongxu commented 4 years ago

The host species is typically annotated, no? You can map to a similar enough host species since divergence with viruses will be large.

rcedgar commented 4 years ago

Show us how! @ababaian and @rcedgar are overextended and don't have time to try this ourselves right now.

I will offer US$250 Amazon gift certificate to first person to implement an open-source method which creates de novo contigs from GroupA: SRR10951654-655 and GroupB: SRR10829951-958. Contigs must be validated against a close known virus reference. Virus reference genomes must not be used as a positive filter before assembly, the key challenge is how to assemble when only a small fragment is recognized.

Offer expires in a week -- contigs and method documentation must be posted before 12pm Pacific time Sunday May 17th.

RCE -- edit to clarify -- You must start from the full SRA dataset, not the BAM file generated by Serratus. In the datasets above, there is a close relative of a known virus, so the BAM files probably include almost all virus reads and very little host. This is easy. The situation we don't know how to handle is where we see reads hitting a short fragment, say one CDS, but not an entire genome. In that case, most of the virus reads will be missing from the Serratus BAM file. Host filtering is allowed (the above datasets have pig hosts, which are a good model for this situation), but filtering by the virus reference is not allowed.

hussius commented 4 years ago

I'll give it a try! The condition "(+) Host reads are a large majority, virus is a small fraction." is typical and was possible to overcome in my old project. But we'll see how the assemblers work on your suggested datasets!

charlescongxu commented 4 years ago

Let me know if you need help!

ababaian commented 4 years ago

Hello all interested parties,

I think we're starting to reach a critical mass of people regarding how to process assembly for Serratus. I'd like to propose we all get on a technical group call and we can begin to address how we want to tackle this and how best to divide the work among us.

I'd like to propose Friday morning (PST) which is Friday evening in Europe. Please submit this dudle poll with your availability and we'll convene with a clear plan of attack then.

ababaian commented 4 years ago

We will be meeting Friday 9AM PDT on Skype. Please DM me your skype details on slack if we haven't already had a chance to chat.

cmorganl commented 4 years ago

MATAM Type: Reference-guided, metagenomic Code: https://github.com/bonsai-team/matam Paper: https://doi.org/10.1093/bioinformatics/btx644 Comments: Given the amount of data we're working with, and that the coronavirus genome is substantially larger than the ~1500 nucleotides of a 16S rRNA gene, I'm not sure how it will scale. But I'm very interested in testing it when the datasets become available. I'd also like to hear feedback if someone has already tried it.

taltman commented 4 years ago

Hi @rcedgar, are you caught up on incorporating these suggestions into the documentation? Or is there something more that we need here? Thanks!

rcedgar commented 4 years ago

@cmorganl "I'm very interested in testing it when the datasets become available." See #89

AndreaGuarracino commented 4 years ago

Shasta Type: De novo assembly from Oxford Nanopore reads Link to code: https: //github.com/chanzuckerberg/shasta Link to paper: https://www.nature.com/articles/s41587-020-0503-6 Comments: It works well, but in needs parameter tuning to do it.

hussius commented 4 years ago

@rcedgar I have written up a methods description for making de novo contigs from your two SRA accession groups here.

Briefly, for the high-coverage case (GroupB), Megahit was able to create a single contig (link) covering the presumably intended target genome. For the low-coverage case, the assembly is more fragmented (I think 37 contigs) so while the contigs (link) do span most of the presumable target genome, there are quite a few gaps.

ababaian commented 4 years ago

@hussius Can you scaffold that onto a genome for the low-coverage set and create a 'genome' with NNNN in between?

rcedgar commented 4 years ago

@hussius Congratulations! Just made it in time, or maybe a bit late... Send an email to robert@drive5.com and let me know which amazon you prefer (.com, .de, .es, .mx...) -- if that actually matters, not sure.

Would be great to see an alignment of your contigs, or even better scaffold, against my assembly (see #89).

taltman commented 4 years ago

https://sanger-pathogens.github.io/iva/

hussius commented 4 years ago

@ababaian Sure. If you are only talking about inserting NNNs between contigs that should be a fairly straightforward scripting task, but if you mean a more "serious" scaffolding, I'd happily take suggestions on good tools to use. I tried something called Medusa on these contigs and it was able to get the number of contigs down to three, but it also lost some sequence so the end product was a ~25 kb assembly.

hussius commented 4 years ago

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now. The best BLAST hit: Screenshot from 2020-05-17 17-21-15

rcedgar commented 4 years ago

@hussius can you comment on suitability of your assembly approach for HPC, i.e. putting in a container and running it in the cloud?

hussius commented 4 years ago

@rcedgar The problem I see is that automating the host sequence removal could be hard, especially if the host doesn't have a good reference genome. Apart from that, the workflow is fairly lightweight and should be reasonably easy to containerize.

hussius commented 4 years ago

@rcedgar Here is an alignment between my assembly for the low-coverage pig virus vs. a FASTA file combining your reference based assemblies for SRR10951654 (called "Genome1" in the file) and SRR10951655 (called "Genome2" in the file): C2X4M578114-Alignment.txt I don't know if this alignment format is convenient; if not, you can BLAST your assemblies against my tentative scaffolded assembly. (My assembly is called "Chr0_RaGOO" because I used a program called RaGOO for scaffolding)

rcedgar commented 4 years ago

@hussius "automating the host sequence removal could be hard". Yes, exactly. Maybe I should have disallowed host filtering and potentially saved myself $250, but it seemed we were making very little progress on assembly so I thought solving an easier problem could get some momentum going. So, how do we tackle host filtering in general?

ababaian commented 4 years ago

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now.

How is this scaffolded? Did you use another reference genome as the backbone?

taltman commented 4 years ago

I've created a stub on the Serratus Assembly Wiki to capture this list of assemblers:

https://github.com/ababaian/serratus/wiki/Serratus-Assembly#list-of-assemblers-to-consider

Please help migrate this great content over there!

rcedgar commented 4 years ago

Sounds like duplicated effort to me. The wiki can link to the issue. Let's be pragmatic and not expend unnecessary effort conforming to a system, we need to focus on the "real" work as much as possible.

taltman commented 4 years ago

@rcedgar Regarding host removal, I am testing the use of Kraken2 for removing reads that are predicted to be prokaryotic or eukaryotic in origin. The hope is that will reduce the overhead tremendously. I'll post results on Slack.

victorlin commented 4 years ago

List of assemblers has been copied from assemblers.md to the relevant wiki page.

ababaian commented 4 years ago

You're a gentleman and a scholar Victor.

sjackman commented 4 years ago

IVA mentioned above by @taltman is missing. It's in Homebrew. https://github.com/ababaian/serratus/issues/71#issuecomment-629756417 https://github.com/sanger-pathogens/iva https://github.com/brewsci/homebrew-bio/blob/master/Formula/iva.rb

sjackman commented 4 years ago

I've found Unicycler very effective for small genomes, especially when you want a usable GFA file for visualization with Bandage. Both tools in Homebrew. https://github.com/rrwick/Unicycler https://github.com/brewsci/homebrew-bio/blob/master/Formula/unicycler.rb

sjackman commented 4 years ago

Shovill https://github.com/tseemann/shovill https://github.com/brewsci/homebrew-bio/blob/develop/Formula/shovill.rb Used by Isolation and rapid sharing of the 2019 novel coronavirus (SARS ‐CoV‐2) from the first patient diagnosed with COVID ‐19 in Australia https://onlinelibrary.wiley.com/doi/full/10.5694/mja2.50569 https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.5694%2Fmja2.50569&file=mja250569-sup-0001-Supinfo.pdf

taltman commented 4 years ago

@sjackman Thanks! We're planning on having a call to discuss our assembly plan, would you be available to join? Here's the link:

https://dudle.inf.tu-dresden.de/serratus001/

sjackman commented 4 years ago

Sorry I didn't see this question until now, and it looks like the meeting has already happened. I'm happy to chat more here on GitHub.

ababaian commented 4 years ago

@sjackman We're meeting tomorrow 9AM :P

sjackman commented 4 years ago

I can make that. Please share the Zoom (or whatever) link with me.

ababaian commented 4 years ago

@sjackman can you email me your skype id

sjackman commented 4 years ago

Reposted from https://github.com/ababaian/serratus/issues/86#issuecomment-632367060

Mash Screen seems relevant. Mash Screen: high-throughput sequence containment estimation for genome discovery https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x See the paragraph Novel virus assembly

Mash Screen is implemented in C++ and is integrated into the existing Mash codebase as of v2.0.

https://github.com/marbl/Mash

rchikhi commented 4 years ago

As per Issue https://github.com/ababaian/serratus/issues/130, a benchmark of some of the assemblers is https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets Please let me know if you'd like it to be updated with your favorite method

ababaian commented 4 years ago

Good to close for now?

rcedgar commented 4 years ago

Yes.