SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
91 stars 29 forks source link

gene_families and pangenome gff file do not match #47

Closed maguileraf closed 4 years ago

maguileraf commented 4 years ago

Hi,

I am looking at my gene_families.tsv file and I have ~11,000 entries in there but when I look at the annotation file (pangenome_alignment.gff) there are only ~7500 and I can't seem to understand why there are a lot of gene families missing?

Thanks in advance, Marcela

SionBayliss commented 4 years ago

The pangenome alignment removes certain potentially problematic sequences which have a high copy number or large number of truncations. Any gene family with an average copy number/gene dosage of >1.25 will not be included in the alignment. This can be modified by changing the appropriate settings in the alignment scripts:

align_feature_sequences.pl --dosage 1.25 -i PIRATE.*.tsv -g ./modified_gffs/ -o /feature_sequences/ -p threads;

create_pangenome_alignment.pl --dosage 1.25 -i PIRATE.*.tsv -f ./feature_sequences/ -o pangenome_alignment.fasta -g pangenome_alignment.gff;

In the case of multi-copy genes PIRATE will pick the longest representative sequence to include per genome.

All the best, Sion

maguileraf commented 4 years ago

Thank you for the explanation. It makes more sense now.

SionBayliss commented 4 years ago

Glad I could help!