SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

Pan genome from genomes of contigs #43

Closed milnus closed 4 years ago

milnus commented 4 years ago

I am trying to construct a pan genome using genomes that consist only of contigs. However, pirate gives me the following error

- running mcl on pan_sequences at 50    
 - 0 clusters at 50 % - completed in 0 secs
 - running mcl on pan_sequences at 60    
 - ERROR: pangenome_construction.pl failed - error logged at /path/to/output/folder/fail_test.txt

fail_text.txt:

BLAST options error: File /path/to/output/folder/pangenome_iterations/pan_sequences.representative.fasta is empty
 - ERROR: no clusters in /path/to/output/folder/pangenome_iterations/pan_sequences.mcl_50.clusters

When I add a complete genome into the pool of genomes, on which the pan genome is constructed, everything works fine. Is this a feature or a bug, and do you have an explanation to why this arises?

SionBayliss commented 4 years ago

Hi Magnus,

PIRATE runs on either complete or draft genomes. It looks like it isn't finding any CDS features in the GFF files. How were your genomes annotated and what input did you provide to PIRATE?

S

milnus commented 4 years ago

Hi Sion, thank you for the answer. I use GFF3-files provided by Prokka.

The stdout tell me that something(- Loci file contains 17390 loci from 10 genomes.) is recognised and it match the expected number of loci.

And it can run the initial CD-Hit clustering, which I would guess wasn't possible without any CDS. However, I can see that all the loci are recognised as core, which leaves nothing for the MCL clustering. That might be what is wrong. By adding in a complete genome I add in non-core genes, which makes everything run smoothly.

SionBayliss commented 4 years ago

Hi Magnus,

I haven't encountered that situation before! All of your genomes must be >98% identical. If you provide PIRATE with the option --pan-opt "--cd-low 100" it should change the lowest CD-HIT threshold to 100% identical. If you have any variation in your genome it will be reflected in the outputs. Your should also change the step sizes with -s to reflect some higher BLAST thresholds (add 98,99,100 to the default range, see help -h).

Very odd! You definitely caught my scripts out there :) S