dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

evaluate results by using spacegraphcats to look at contig locality #54

Open ctb opened 4 years ago

ctb commented 4 years ago

once we have some pretty clear contaminants nailed down with this tool, we could try doing additional validation by using spacegraphcats. The basic question would be, are potential contaminant contigs graph-distant from the rest of their putative genomic contigs?

the main difference from our more grandiose plans for sgc/charcoal here is that we'd have a strong set of initial hypotheses to work from, and wouldn't need to think about whole-graph analysis at the start.

taylorreiter commented 4 years ago

As a first pass, I took a bin that had two contaminant contigs in it and queried with the whole bin. I hoped that the nbhd for this query would separate out, where the two contaminants would appear distant from the rest of the nbhd. I'm attaching a bandage plot of the bcalm cdbg for the bin query in one nbhd. The BLAST matches are colored. As can be seen from the plot, the contamination does not appear to clearly separate the rest of the nbhd. I think we'll need more sophisticated ways to get at this question. SRR1211157_bin 8 fa gz dirty bandage

More details: Charcoal output for the query bin:

genomefile | brieftax | f_major | f_ident | f_removed | n_reason_1 | n_reason_2 | n_reason_3 | refsize | ratio | clean_bp | clean_n | dirty_n | dirty_bp | missed_n | missed_bp | taxguessed | taxprovided | comment
SRR1211157_bin.8.fa.gz | genus g__Gemmiger | 0.92921074 | 0.59143407 | 0.02517407 | 0 | 1 | 1 | 3125000 | 0.67 | 2087657 | 237 | 2 | 53912 | 5 | 15618 | d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Gemmiger;s__Gemmiger   formicilis

This metagenome bin was from metagenome SRR1211157. I downloaded, adapter trimmed, human-removed, and kmer trimmed this metagenome in github.com/dib-lab/2020-ibd, and already had a catlas built for it. I queried the metagenome SRR1211157 with the bin SRR1211157_bin.8.fa.gz using an r1 and k31. Then, with the cdbg_ids.reads* file output by this query, I used bcalm to make a cdbg (unitigs.fa), and converted this file to gfa format. I plotted the gfa files as a bandage graph and blasted the two contaminant contigs against it. All of the 100% matches are colored for one of the contigs above.

taylorreiter commented 4 years ago

Here's another png from the bandage viz. I've zoomed out so it's very difficult to make out the colored portions of the graph that indicate the BLAST matches, however from this image we can see that there are disconnected components in this cDBG, but these do not match to the contaminant sequences. SRR1211157_bin 8 fa gz dirty bandage3

ctb commented 4 years ago

so (brainstorming a bit, it's not clear you can do this easily with sgc just yet) --

what if we did queries with each of the contigs, instead of the whole genome bin, and asked which contigs transitively connect to other contigs? I would hope for/expect contaminant contigs to NOT be in the neighborhood of the rest of the genome, for some radius.

If you wanted to try this in an ad hoc way, you could:

taylorreiter commented 4 years ago

I think I understand but want to double check -- we would expect that when doing single-contig (or small contig group) queries from the same genome bin, that some queries would bring in the same dom sets. In doing so, they would have overlapping k-mers. However, if a contig were a contaminant, we would expect it to have lower or no k-mer overlap with the nbhds from non-contaminant contig queries.

ctb commented 4 years ago

yes

taylorreiter commented 4 years ago

cool! I'll put it on my list of things to play with :)