dib-lab / 2021-panmers

Exploring amino acid kmers as a substitute for genes in microbial pangenome analysis
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

flag outlier genomes using # genes vs #k-mers #1

Open bluegenes opened 2 years ago

bluegenes commented 2 years ago

image

Your results show that the correlation btwn # protein k-mers and # genes per genome is quite good for most of the species (you said generally 0.9 and usually nearly 1), and outliers (like those on this plot you showed me) tend to be genomes that were excluded from refseq for having many frameshifted proteins. Given this, we could (e.g. as part of charcoal) flag/report genomes that are outliers relative to other genomes of the same species.

taylorreiter commented 2 years ago

Right so we probably wouldn't be able to flag outliers without also annotating CDS within a genome, or without providing a gff file that does that, but we could potentially provide a range of expected # of genes, as you mentioned during our call. So far, I've only checked these correlations on a per-species basis. I think for this idea to work, we would need a single regression line for "all" species (and a confidence interval around it). I can try that with the data I have to see if most species follow the same pattern