bacpop / ggCaller

Bifrost graph gene caller.
MIT License
86 stars 6 forks source link

interpret the gene_presence_absence.csv #26

Closed abcdtree closed 6 months ago

abcdtree commented 6 months ago

Hi Sam,

In the gene_presence_absence.csv table, for each genome I input, If I understand it right, there is 5_refound_-1718 or 5_0_24879 if the gene in this row presents and empty if it is not.

I guess 5 in the front is the genome index, but what does the refound and 0 mean? And also what is the number at the end?

Thanks,

Josh

samhorsfield96 commented 6 months ago

Hi Josh,

The naming system is an artefact from Panaroo. For the gene name X_Y_Z:

'refound' refers to a gene prediction that was missed on first pass, but was recalled by Panaroo's gene refining algorithm. These genes also have a negative gene index to distinguish them from non-refound genes.

In the gene_presence_absence.csv table, each cell describes a specific gene found in a given genome (each column) which belongs to a given gene cluster (each row). Indeed, if this cell is empty, a gene belonging to that cluster is not found in that genome.

Hope this helps!