Open whottel opened 5 months ago
Roary is no longer maintained. (https://github.com/sanger-pathogens/Roary). I ran into many issues with Roary that several other people had also run into, such as having samples randomly being dropped from the analysis.
When it came time to update Grandeur, I knew I needed to replace Roary with something currently maintained. The options I chose to experiment with were:
Admittedly, I didn't have a lot of time for this comparison. So in addition to my sample set (a cluster of seven Serratia samples), I relied on the literature. The two most influential papers were file:///Users/eriny/Downloads/s13059-020-02090-4.pdf and https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000690.
All of these tools tout themselves as being a roary replacement, but they are all slightly different from roary and from each other.
Here's a figure that may or may not be helpful: https://link.springer.com/article/10.1186/s13059-020-02090-4/figures/2
In summary, for core genome analysis, panaroo does identify more pairwise SNPs that Roary.
The default arguments for panaroo in Grandeur are --clean-mode strict --remove-invalid-genes
, which is supposedly the best way to remove possible contamination. As observed in the file that you sent to me, the total number of genes accepted for each isolate in the analysis is smaller (i.e. 2017-30-24 goes from 3,775 genes identified by roary to 3,702 genes identified by panaroo). Panaroo is more selective as to what is actually a gene. Genes are then matched to other isolates through other methods. Roary used blast, and panaroo uses cd-hit. It looks like panaroo is using genes in the core genome that roary does not match together. Your Version v3.6.20231219 (roary) has a core genome of 3,187 genes, while version 4+ has 3,255 genes. I imagine these genes hold most of your additional SNPs.
This unfortunately, is going to be a similar story with both ppanggolin and pirate as well.
Panaroo was used for Grandeur due to its results being basically the same in my sample set and ease of use. I had issues running pirate with singularity (perl issues ftw!), and ppanggolin is much more complicated to use.
I think there is a way to adjust panaroo to run more like roary (https://gthlab.au/panaroo/#/gettingstarted/params), but the sample set that I was using doesn't look like a good fit for this. Are the samples in your comparison on the SRA? It looks like it would be perfect for fine tuning some arguments.
Hi Erin,
Thanks for providing those details. Most of these in my example were sequenced by CDC and have not been uploaded to NCBI. However, I did reach out and there are fine with me sharing the files with you. I will send a shared drive link.
I have good news!
Yay?
These results were from using the contigs.
ppanggolin all --fastas list_of_fasta.txt --cpu 20 --outdir out
ppanggolin msa -p out/pangenome.h5 --cpu 20 --outdir out2 --phylo
snp-dists out2/core_genome_alignment.aln
This designates 3612 "persistent" genes, which is more than both roary and panaroo.
The SNPs identified, though, were much smaller. Suspiciously so.
snp-dists 0.8.2 2017-30-06 2017-30-24 2017-30-A 519946 572111
2017-30-06 0 0 0 1 0
2017-30-24 0 0 0 1 1
2017-30-A 0 0 0 1 1
519946 1 1 1 0 1
572111 0 1 1 1 0
Now for some bad news.
I ran your samples by adjusting several of panaroo's parameters listed at https://gtonkinhill.github.io/panaroo/#/gettingstarted/params
I didn't see a big difference in the core genome size. When I looked at the results from snp-dists, I don't see a real difference there either.
The SNP matrices of my exploration today.
# setting -f to 0.8 (core = 3,248)
==> 0.8_grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 2017-30-A 519946 Proteus_mirabilis_GCF_000069965.1 2017-30-24 2017-30-06 572111
2017-30-A 0 73 16890 207 201 48
519946 73 0 16413 278 274 90
Proteus_mirabilis_GCF_000069965.1 16890 16413 0 17111 17095 16373
2017-30-24 207 278 17111 0 8 253
2017-30-06 201 274 17095 8 0 249
572111 48 90 16373 253 249 0
# setting -f to 0.9 (core = 3,240)
==> 0.9_grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 2017-30-A 572111 2017-30-24 Proteus_mirabilis_GCF_000069965.1 519946 2017-30-06
2017-30-A 0 48 207 16055 77 201
572111 48 0 253 15546 94 249
2017-30-24 207 253 0 16276 282 8
Proteus_mirabilis_GCF_000069965.1 16055 15546 16276 0 15580 16260
519946 77 94 282 15580 0 278
2017-30-06 201 249 8 16260 278 0
# default grandeur (core = 3,257)
==> grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 Proteus_mirabilis_GCF_000069965.1 2017-30-24 2017-30-A 572111 519946 2017-30-06
Proteus_mirabilis_GCF_000069965.1 0 18694 18468 17988 17992 18673
2017-30-24 18694 0 207 289 278 8
2017-30-A 18468 207 0 84 73 201
572111 17988 289 84 0 126 285
519946 17992 278 73 126 0 274
2017-30-06 18673 8 201 285 274 0
# using --clean-mode moderate (core = 3,264)
==> moderate_grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 2017-30-A Proteus_mirabilis_GCF_000069965.1 572111 2017-30-06 519946 2017-30-24
2017-30-A 0 18469 84 201 73 208
Proteus_mirabilis_GCF_000069965.1 18469 0 17989 18674 17993 18696
572111 84 17989 0 285 126 290
2017-30-06 201 18674 285 0 274 9
519946 73 17993 126 274 0 279
2017-30-24 208 18696 290 9 279 0
# not using -a (core = 3,257)
==> noa_grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 2017-30-24 2017-30-06 2017-30-A Proteus_mirabilis_GCF_000069965.1 572111 519946
2017-30-24 0 8 207 18694 289 278
2017-30-06 8 0 201 18673 285 274
2017-30-A 207 201 0 18468 84 73
Proteus_mirabilis_GCF_000069965.1 18694 18673 18468 0 17988 17992
572111 289 285 84 17988 0 126
519946 278 274 73 17992 126 0
# using --clean-mode sensitive (core = 3,264)
==> sensitive_grandeur/snp-dists/snp_matrix.txt <==
snp-dists 0.8.2 2017-30-06 2017-30-A Proteus_mirabilis_GCF_000069965.1 519946 2017-30-24 572111
2017-30-06 0 201 18674 274 9 285
2017-30-A 201 0 18469 73 208 84
Proteus_mirabilis_GCF_000069965.1 18674 18469 0 17993 18696 17989
519946 274 73 17993 0 279 126
2017-30-24 9 208 18696 279 0 290
572111 285 84 17989 126 290 0
Looks like I can't use bugseq, so I can't compare refmlst. It looks like it uses minimap2 to identify genes of interest and it maps reads onto those genes.
Thanks for running all those tests, but still not a clear picture of what is going on here. I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.
Thanks for running all those tests, but still not a clear picture of what is going on here. I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.
My apologies for my late response.
I've copied the gene_presence_absence.Rtab
for panaroo and roary.
Roary uses blast to see if genes are the same between organisms. Panaroo uses CD-hit. As such, they determine different genes are shared between the samples.
For example, Roary determines arnC_1
is shared by all these samples, while panaroo does not. Similarly, arnC_4
is predicted to be shared by all samples with panaroo, but not roary. I think roary might be more effective at grouping similar genes together, which panaroo is grouping genes that might be more distant, hence why there are more SNPs.
The gene presence/absence files: panaroo_gene_presence_absence.Rtab.txt roary_gene_presence_absence.Rtab.txt
I have not found a combination of parameters for panaroo to adjust it such that it matches that of roary.
Pathogen Surveillance uses PIRATE (https://github.com/nf-core/pathogensurveillance/tree/dev) which is also blast based, so this workflow might be helpful to you. I would like to add PIRATE to my comparison, but I have not gotten to work with my local environment (perl issues with singularity).
The next version of grandeur (https://github.com/UPHL-BioNGS/Grandeur/pull/191) will support using roary instead of panaroo by setting the param.aligner to roary (--aligner roary on the command line) so you can continue to use roary.
I think bactopia also has a pirate option for pangenome analysis too.
Also, I JUST FOUND THIS PAPER/TOOL (https://github.com/maxgmarin/panqc). This should add some additional QC to panaroo. I'm going to try this tomorrow (or sometime soon).
Hello,
I went back through an old cluster analysis to see what differences in interpretation would potentially arise in using with an older grandeur version that uses roary versus newer versions that are now using panaroo. Please find a summary excel file with this comparison: Grandeur_Version_Matrix_Comparisons_Issue.xlsx
This comparison includes sequences that were part of a 2017 P. mirabilis outbreak. Grandeur with roary identified 0-3 SNPs among these outbreak sequences. With other versions of grandeur that use panaroo, two sequences seem to form their own subcluster and differ by >200 SNPs from the other members of the outbreak cluster. Surprisingly, 2017-A and 2017-D are from the same sample but again with panaroo differ by hundreds of SNPs. Also, included is a third-party refMLST analysis, which generally agrees with the roary SNP matrix as far as interpretation.