eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
109 stars 26 forks source link

Why does the pangenome size differ so much? #77

Closed TommyH-Tran closed 3 years ago

TommyH-Tran commented 3 years ago

The pangenome size was calculated using COGS and OMCL with the ./get_homologues.pl command when using -t 0. venn_t0_Cacc.pdf

Then the pangenome estimated size was calculated using sampling. Cacc_pan_genome_algOMCL tab_pan-01

I was wondering why there is such a large difference between both of the results. You can see the pangenome size from the venn diagram is approximately 2,000 GCs larger? Is there a way to mitigate this drastic difference? When I present data this has been consistent for multiple sets of species.

brunocontrerasmoreira commented 3 years ago

Please see http://eead-csic-compbio.github.io/get_homologues/manual/manual.html#SECTION00063000000000000000 :

">The number of clusters produced with -C 75 -S 70 does not match the pangenome size estimated with option -c

The reason for these discrepancies is that they are fundamentally different analyses. While the default runmode simply groups sequences trying to put in the same clusters orthologues and inparalogues, a genome composition analysis performs a simulation in order to estimate how many novel sequences are added by genomes sampled in random order. In terms of code, there are a couple of key global variables set in lib/marfil_homology.pm, lines 130-131, which control how a gene is compared to previously processed pangenome genes in order to call it novel:

$MIN_PERSEQID_HOM = 0; $MIN_COVERAGE_HOM = 20;

The first variable is set by default to 0, meaning that there is no %identity limitation to call homologues. The second is set to 20, which means that any sequence matching a previous gene with $ coverage \ge 20\%$ will be considered homologous and thus won't be considered new. As you can see these are very stringent values.

Now, in your settings, you might want to change these values. For instance, Tettelin in their landmark work used values of 50 and 50 (PubMed=16172379), which means that protein sequences with $ coverage \ge 50\% & $ and $ identity \ge 50\%$ to previous genes will be called homologues, and therefore won't be accumulated to the growing pangenome. In other words, you should tweak these variables to your particular settings. "

TommyH-Tran commented 3 years ago

Thank you this was helpful to know, I will try out the other stringencies.