alexcritschristoph / soil_popgen

Reproducible scripts and notebooks for 2019 paper on population genetics in metagenomes
GNU General Public License v3.0
13 stars 0 forks source link

How to do pan-genome decontamination #7

Open haihao999 opened 2 years ago

haihao999 commented 2 years ago

Hi, Now I'm already doing other steps, but I didn't do pan-genome decontamination, because I am using roary for the first time, Counld you tell me how to do this in detail? Thanks!

alexcritschristoph commented 2 years ago

Hello, Pan-genome decontamination could be done in many different ways. First you will want to dereplicate your genomes and have a look at each cluster with very similar ANI - say >95%. Then you can run roary on them to group genes from each genome into protein clusters. The trick is then to parse the roary output to identify contigs in any given genome that are characterized by genes not found in other genomes - previously I used "Contigs with at least 50% of their protein clusters found in less than 25% of each genome set were then discarded as potential contamination (generally fewer than 20 contigs, often small, were removed per genome). ".

I don't have a script on hand to do this, but it should not be hard, and indeed would be a good exercise, to do in python!

Pseudocode:

for each gene cluster, count the number of genomes it appears in

for each contig 
    for each gene cluster of genes on that contig
        check if that gene cluster appears in >25% of genomes
    if gene clusters on this contig don't appear in >25% of genomes at least 50% of the time:
        mark that contig as contamination