SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
90 stars 29 forks source link

Confused with terminology/output #67

Closed FilipeMatteoli closed 3 years ago

FilipeMatteoli commented 3 years ago

Thanks a lot for the tool, seems a great advance for pangenome studies. I was able to run PIRATE on my dataset but I am a bit confused about the output interpretation. I ran a pangenome estimation using the following command: PIRATE -i '/media/filipe/genomes_Rhza1/pangenome/pirate_out/gff' -o '/media/filipe/genomes_Rhza1/pangenome/pirate_out' -s "50,60,70,80,90,95,96,97,98" -t 12 -r

First the pangenome_summary.txt shows that I have 6280 gene families. I confirmed this value by the number of lines in the PIRATE.gene_families.ordered.tsv. The problem is that this number is far below the 12k pangenome size I had obtained earlier using Roary at 95% id threshold.

I proceeded to check my number of clusters per threshold graph, it starts around 6000 (at 50% threshold) and goes over 12000 at 98% threshold. By reading the tool paper I understood that these cluster values mean that as cd-hit tries to cluster the proteins using higher thresholds they split into distinct clusters, thus increasing the number of clusters.

After I assumed that I understood it correctly, I tried to understand your conclusion on the S. aureus pangenome in your paper with no avail. You stated that "The pangenome of S. aureus comprised 4,250 gene families", while the graph on 3C shows more than 7k clusters. This brought me back to the beginning where I am not sure whether clusters and gene families are the same.

Back to my data (I discussed your paper just to highlight that I had the same doubt both using my data or trying to understand yours). My main doubt is which identity threshold was used by PIRATE to determine that my dataset has 6280 gene families? Was it the lowest? How should I gain a better insight on which threshold is more accurate for my dataset based on plots? Should I explore the threshold column in PIRATE.gene_families.ordered.tsv to check how many families are clustering at a given threshold and then rerun lowering my highest cutoff?

Really appreciate any help. Best,

SionBayliss commented 3 years ago

Hi Filipe,

PIRATE clusters down to 98% (default) using CD-HIT, then performs all-vs-all BLAST/DIAMOND followed by hierarchical MCL clustering at different amino acid similarity cutoffs. The lowest % identity MCL clusters are used as the gene_families/clusters in the output of PIRATE and the highest %id value for that cluster at which all of the isolates are clustered together is reported as the 'threshold' value in the output file (e.g. all 4 sequences in geneA cluster together at 50,60,70,80,90% identity but split into two clusters of 2 sequences at 95% therefore 90% is reported as the threshold value in gene_families.tsv).

As PIRATE uses a much lower default %id value than roary it will almost always report smaller number of genes (gene_families). At higher thresholds (98%) these will be more in line with default roary. PIRATE reports these higher %id clustering as 'alleles' of a gene family.

PIRATE may report some genes with very high copy number per isolate, which may represent related gene families which are poorly resolved by sequence similarity and the rudimentary paralog splitting algorithm used in PIRATE. If this is the case there are a few things I can suggest to improve the resolution of the clustering.

As for the selection of thresholds, it will depend on if you have a clonal or diverse collection of isolates. If you have a reasonably clonal collection I would increase the default % identity threshold values as lower thresholds are providing little useful information and may lead to clustering of closely related genes (Note: you can also increase the CD-HIT clustering cutoff to provide finer scale allelic info for clonal collections). For diverse collections lower thresholds would be more suitable, but comparison to tools like roary, which are designed for relatively clonal collections/species, may not be productive.

All the best, Sion