SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
89 stars 29 forks source link

pangenome_alignment.fasta file explanation #35

Closed cizydorczyk closed 4 years ago

cizydorczyk commented 4 years ago

Hello,

I just wanted to clarify what the pangenome alignment file contains. From my understanding, it contains all genes in the pangenome, concatenated together, with a sequence for each sample in the PIRATE run. Genes are in the same order as in the ordered tsv file. Is this correct?

If so, are the genes concatenated simply end-to-end, or is some spacer used? And how are missing genes handled? "N"s or "-"? I understand that dosage > 1 genes are replaced by "?", but does this refer to all copies of the gene, or subsequent (after the first, however that is defined)?

Thank you! Conrad

SionBayliss commented 4 years ago

Hi Conrad,

The genes are concatenated end to start (forward orientation) with no spacer. They are printed in the order of the genes in *.ordered.tsv file (default). The GFF denoted the position of each gene. Ambiguous positions, such as multiple potential copies or the same gene, are replaced with a dash ("-") to differentiate them from missing genes or gaps caused by alignment which are denoted with an N (Note: the ? is likely due to an older version of PIRATE).

There will only ever be one sequence per genome for a gene family. The option --multi-include will include the longest version of multi copy genes instead of replacing them with a dash.

There are some other options in create_pangenome_alignment script that maybe useful.

All the best, Sion

cizydorczyk commented 4 years ago

Great, thank you for the clarification!