ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
259 stars 34 forks source link

Search for recombination events? #168

Open rcedgar opened 4 years ago

rcedgar commented 4 years ago

I believe recombination events can be detected by sliding a window down a genome (or contig) to find the most similar known genomes for each window. Discontinuities in this list of top hits and their identities indicate a recombination. I think we should implement something like this for all known genomes and for new assemblies. Possibly I could implement such a tool if needed. I haven't checked for existing tools which might be able to do this, if someone could look into this & add comments here that would be great.

ababaian commented 4 years ago

I'm 100% on board with doing this analysis. We'll have a nice data-set to do this on.

taltman commented 4 years ago

Kraken2 can do this, pretty sure!

On June 21, 2020 8:46:07 AM PDT, Artem Babaian notifications@github.com wrote:

I'm 100% on board with doing this analysis. We'll have a nice data-set to do this on.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ababaian/serratus/issues/168#issuecomment-647144814

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

ababaian commented 4 years ago

We'll probably want something closer to the Figure 2 in the Pangolin paper

rcedgar commented 4 years ago

@taltman Can you point me at the relevant documentation? As I understand it, the Kraken2 index doesn't store coordinates of k-mers, only taxonomy ids, so this would need to be a special feature somewhere.

Edit -- oh hang on, I see, you're suggesting we could run Kraken2 separately on each window. That might work.

rcedgar commented 4 years ago

@ababaian As I understand it, that figure was made by the sliding window method followed by manual (i.e. visual) analysis to identify the discontinuities. That's fine for a single genome, but not amenable to high-throughput. We could show one or two examples like that should we be successful in implementing a method.

ababaian commented 4 years ago

Well we can use the inflection points between two lines to predict recombination windows right :) If we can do it by eye, we can teach a computer to do it high throughput

rcedgar commented 4 years ago

Yes, exactly -- the question was whether we/I need to write a new tool for this. If someone else would like to tackle this one, great!

taltman commented 4 years ago

Kraken reports the LCA of each kmer along the length of the read / contig. I can help make a custom DB of Coronavirus sequences

See this:

https://github.com/DerrickWood/kraken2/wiki/Manual

A space-delimited list indicating the LCA mapping of each k-mer in the sequence(s). For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:

the first 13 k-mers mapped to taxonomy ID #562

the next 4 k-mers mapped to taxonomy ID #561

the next 31 k-mers contained an ambiguous nucleotide

the next k-mer was not in the database

the last 3 k-mers mapped to taxonomy ID #562

Note that paired read data will contain a "|:|" token in this list to indicate the end of one read and the beginning of another.

On June 21, 2020 9:52:28 AM PDT, Robert Edgar notifications@github.com wrote:

@taltman Can you point me at the relevant documentation? As I understand it, the Kraken2 index doesn't store coordinates of k-mers, only taxonomy ids, so this would need to be a special feature somewhere.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ababaian/serratus/issues/168#issuecomment-647153351

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

ababaian commented 4 years ago

We're plotting percent id(%) in a sliding window, doesn't that kind of defeat the purpose of a kmer?

rcedgar commented 4 years ago

In theory, k-mers could work because k-mer identity correlates quite well with alignment identity. With kraken2 specifically I doubt it will work because they index only a small subset of the k-mers.

rcedgar commented 4 years ago

Gideon Mordecai suggests RDP4 for recombination? https://academic.oup.com/ve/article/1/1/vev003/2568683

Rob Lanfear suggests Phipack and 3seq are both good for recombination detection too. Different approaches, both powerful in their own ways.

https://www.maths.otago.ac.nz/~dbryant/software/phimanual.pdf

https://mol.ax/content/media/2018/02/3seq_manual.20180209.pdf

szhan commented 4 years ago

@ababaian mentioned that no one has been tackling this, so I'mma make an attempt if you guys don't mind!