SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

0 core loci #62

Closed haruosuz closed 3 years ago

haruosuz commented 3 years ago

Dear developer,

I tried to identify core loci of plasmids belonging to IncP-1 groups (ACCESSION: BN000925 U67194 AB231906 AM157767 GQ983559 AM261282 AY950444 KC170283 KP324830 KX377410)

Running the gene screen method of LS-BSR (https://github.com/jasonsahl/LS-BSR) with ls_bsr.py -g AB231906.pep -b tblastn detected core loci for the 10 related plasmids.

However, running PIRATE with lower amino acid % identity thresholds (--steps 20,30,40,50,60,70,80,90,95,98) printed 0 core loci (0%).

I would be most grateful if you could suggest which parameters might work best for this purpose.

https://github.com/bactopia/bactopia/issues/100#issuecomment-733707307

SionBayliss commented 3 years ago

Dear Haruo,

I speculate that one of three things maybe happening:

1/ Plasmids contain many potential paralogs which will be split by PIRATE on default setting. The paralog splitting is fairly rudimentary and may over-splitting your core genes into multiple accessory genes. This could be checked by seeing how many gene families were split during the paralog splitting step (in the log file). Paralog splitting can also be switched off from the command line using --para-off.

2/ The methods for generating clusters from BLAST outputs via MCL are inherently different between the pipelines. LS-BSR uses Bit Score Ratio which has been shown to rescue diverged paralogs and maybe more sensitive than PIRATEs method for highly divergent genes (I have not compared them).

3/ I am unsure if LS-BSR filters BLAST HSPs at all. You could switch off filtering in PIRATE by selecting a minimum --steps size of 0 (e.g. --steps 0,10,20...). This will run MCL on all significant BLAST hits with an e-value <1E-6. You can adjust the e-value filter using the --pan-opt "--evalue *" if you require a more relaxed filter threshold.

I hope this helps in discovering the cause of the discrepancies.

All the best, Sion

haruosuz commented 3 years ago

Thank you for your advice.

I wonder if the option "-k" is equal to "--pan-opt"? https://github.com/SionBayliss/PIRATE#advanced-examples

Running PIRATE using --steps 0,10,20,30,50,70,90,95 --pan-opt "--evalue 0.1" --para-off still printed 0 core loci (0%). PIRATE.log

SionBayliss commented 3 years ago

Hi Haruo,

I think you have confused some of the outputs. The 'core loci' during pangenome construction (0) are those identified as >98% nucleotide identity by CD-HIT before clustering using BLASTP/MCL. This isn't core after the final clustering.

The pertinent part of the output is this part:

%isolates #clusters >1 allele fission/fusion multicopy 0-10% 83 3 0 0 10-25% 32 3 0 0 25-50% 15 5 0 0 50-75% 4 1 0 0 75-90% 5 1 0 0 90-95% 0 0 0 0 95-100% 29 19 0 0

This shows that 29 genes are found in >95% of all samples i.e. these are core.

The details on what these genes are will be found in PIRATE.gene_families.tsv (the number of genomes column).

All the best, Sion