dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

single query contamination - more brainstorming #151

Open ctb opened 4 years ago

ctb commented 4 years ago

ref #132, #125

some ideas for display!

  1. response curve idea - show a graph of how much you have to remove to get rid of contamination. could fit that to sigmoid, use AUC kept to pick out what tax rank to clean at.

image

@bluegenes:

This seems like a handy assessment of “how contaminated is your dataset”

  1. contig line up - each line represents a hash that is shared between database sequence (left) and query sequences (right). sequences would be ordered by size, based on intuition that short contigs are more likely to be binned incorrectly.

image

@bluegenes:

Not sure I understand this one completely. Would it only be for presumed contam? Worried it might be a mess? But could end up being a neat way to ID common contaminants?

  1. dot plot idea - query genome sequence coordinates, ordered by contig size, on top; database match(es) on left, ordered again by contig size. each ~dot represents shared hashes, or shared hashes within blocks of sequence, or something. Upper left suggests legit matches, lower right suggests not so much.

image

@bluegenes:

I think this one could be good! If heatmap instead of dots, the color could vary w the number (or fraction) of matched hashes

  1. we could also prepare a summary of our database genomes that lets us retrieve hashes by contigs, or contig-specific taxonomy, or something. would bulk up required downloads tho.
ctb commented 3 years ago

j'accuse!

the face of contamination...

foo

ctb commented 3 years ago

OK, I can produce these diagrams easily for any two genomes, now, using mashmap.

2

I'm thinking about -

ctb commented 3 years ago

Thought about this some more. The dotplot and other plots will work for really egregious types of contamination, but they're not super general - they handle certain cases well, like large contigs that are identical, but they don't handle more subtle cases well, like contamination spread throughout.

I suspect that dotplots and/or slope graphs will be part of a good summary, along with the response curve plot (top of this issue) and probably just a straight up alignment/copy paste-able output of the likely contaminated sequence(s). I'm looking at mummer for that.

Oh, and some estimates of ANI would be good, too. For each contaminant, "the query genome shares X bp at Y ANI with subject genome".

and/or highlight contigs where there is high aligned containment in the other genome, e.g. not just aligned sequences but fully aligned contigs.

ctb commented 3 years ago

put stacked dot plots code here -- https://github.com/ctb/2020-stacked-dot-plots

ctb commented 3 years ago

foo foo2