SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

why and how PIRATE create pangenome.gfa file? #59

Closed limin321 closed 3 years ago

limin321 commented 3 years ago

Hi,

I ran PIRATE pangenome analysis of a group of bacteria strains, and one output file is called pangenome.gfa, which states unique connection between gene families. In pangenome.gfa, I am able to see g005352 connects with all following gene families: L g004900 + g005352 + 0M L g004979 + g005352 + 0M L g005352 - g006024 + 0M L g005352 - g007263 + 0M L g005352 - g014386 + 0M L g005352 + g004800 + 0M L g015712 + g005352 + 0M L g025775 + g005352 + 0M L g026645 + g005352 + 0M L g027770 + g005352 + 0M L g034584 + g005352 + 0M L g042506 + g005352 + 0M

Because I am also interested in Gene association, so I ran coinfinder trying to see which genes are associated with each other. Interestingly, I also found the following two genes g005352_000002__g005352,g004800_000002__g004800; are associated as indicated by PIRATE that they are uniquely connected.

My question is, how to understand the connections between genes in gfa files? Does it mean association among different genes?

I understand there will be overlapping when do genome assembly between different reads. However, my pirate input are either complete genomes or contigs, and the input for PIRATE is gff files from prokka annotation. How to understand these genes unique connections, and the overlapping between genes?

I actually also posted this question in gfa github page 2 weeks ago, and haven't got answer yet.

Thank you so much.

Really appreciate that. Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

The script for PIRATE is simple. It considers a edge to be present when two genes are found next to each other on a contig. If the genes are always found togethe on all contigs then there will be only one edge between the genes. If they are found next to multiple genes then there will be multiple edges. The script also considers direction which will be indicated in the .gfa file by a loop from one end of the gene to another.

All the best, Sion