TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
200 stars 51 forks source link

Contamination in GCP plots #150

Closed Giacomoggioli closed 2 years ago

Giacomoggioli commented 3 years ago

Hello,

I am working with two species of animals involved in symbiosis with Bacteria. Therefore, the Illumina reads we have obtained are coming from both the host and the symbiont. I have used kat gcp to obtain the attached plots. Now I would like to extract just the host's k-mers in order to be able to estimate the proper genome size. I have seen that I should be able to do this using "kat filter kmer" and "kat filter seq" but I am not sure about which k-mers are coming from the host and which are coming from the Bacteria. Is there a way to tell this by looking at the gcp plots? Finally, which --threshold would you suggest me to use with "kat filter seq"?

Best regards,

Giacomo

kat-sp2.pdf kat-sp1.pdf

gonzalogacc commented 3 years ago

Hi Giacomo. One way of identifying the distributions is by the approximate genome size they appear to sample, just count the number of kmers under each distribution. For this, you need to know roughly your genome sizes. Another option is to make a draft assembly and then use kat sect to project the kmer count on top of the sequences (contigs should be enough). Then blast a few sequences to identify which contig comes from which distribution. Hope it helps! Best Gonza.-