UPHL-BioNGS / Grandeur

UPHL's Reference Free Pipeline
GNU General Public License v3.0
23 stars 7 forks source link

Panaroo vs. Roary #188

Open whottel opened 5 months ago

whottel commented 5 months ago

Hello,

I went back through an old cluster analysis to see what differences in interpretation would potentially arise in using with an older grandeur version that uses roary versus newer versions that are now using panaroo. Please find a summary excel file with this comparison: Grandeur_Version_Matrix_Comparisons_Issue.xlsx

This comparison includes sequences that were part of a 2017 P. mirabilis outbreak. Grandeur with roary identified 0-3 SNPs among these outbreak sequences. With other versions of grandeur that use panaroo, two sequences seem to form their own subcluster and differ by >200 SNPs from the other members of the outbreak cluster. Surprisingly, 2017-A and 2017-D are from the same sample but again with panaroo differ by hundreds of SNPs. Also, included is a third-party refMLST analysis, which generally agrees with the roary SNP matrix as far as interpretation.

erinyoung commented 5 months ago

Roary is no longer maintained. (https://github.com/sanger-pathogens/Roary). I ran into many issues with Roary that several other people had also run into, such as having samples randomly being dropped from the analysis.

When it came time to update Grandeur, I knew I needed to replace Roary with something currently maintained. The options I chose to experiment with were:

Admittedly, I didn't have a lot of time for this comparison. So in addition to my sample set (a cluster of seven Serratia samples), I relied on the literature. The two most influential papers were file:///Users/eriny/Downloads/s13059-020-02090-4.pdf and https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000690.

All of these tools tout themselves as being a roary replacement, but they are all slightly different from roary and from each other.

Here's a figure that may or may not be helpful: https://link.springer.com/article/10.1186/s13059-020-02090-4/figures/2

In summary, for core genome analysis, panaroo does identify more pairwise SNPs that Roary.

The default arguments for panaroo in Grandeur are --clean-mode strict --remove-invalid-genes, which is supposedly the best way to remove possible contamination. As observed in the file that you sent to me, the total number of genes accepted for each isolate in the analysis is smaller (i.e. 2017-30-24 goes from 3,775 genes identified by roary to 3,702 genes identified by panaroo). Panaroo is more selective as to what is actually a gene. Genes are then matched to other isolates through other methods. Roary used blast, and panaroo uses cd-hit. It looks like panaroo is using genes in the core genome that roary does not match together. Your Version v3.6.20231219 (roary) has a core genome of 3,187 genes, while version 4+ has 3,255 genes. I imagine these genes hold most of your additional SNPs.

This unfortunately, is going to be a similar story with both ppanggolin and pirate as well.

Panaroo was used for Grandeur due to its results being basically the same in my sample set and ease of use. I had issues running pirate with singularity (perl issues ftw!), and ppanggolin is much more complicated to use.

I think there is a way to adjust panaroo to run more like roary (https://gthlab.au/panaroo/#/gettingstarted/params), but the sample set that I was using doesn't look like a good fit for this. Are the samples in your comparison on the SRA? It looks like it would be perfect for fine tuning some arguments.

whottel commented 4 months ago

Hi Erin,

Thanks for providing those details. Most of these in my example were sequenced by CDC and have not been uploaded to NCBI. However, I did reach out and there are fine with me sharing the files with you. I will send a shared drive link.

erinyoung commented 4 months ago

I have good news!

  1. I was able to replicate your findings

Yay?

  1. I also ran these files through ppanggolin, just to see if those results to be more like roary.

These results were from using the contigs.

ppanggolin all --fastas list_of_fasta.txt --cpu 20 --outdir out
ppanggolin msa -p out/pangenome.h5 --cpu 20 --outdir out2 --phylo
snp-dists out2/core_genome_alignment.aln

This designates 3612 "persistent" genes, which is more than both roary and panaroo.

The SNPs identified, though, were much smaller. Suspiciously so.

snp-dists 0.8.2 2017-30-06      2017-30-24      2017-30-A       519946  572111
2017-30-06      0       0       0       1       0
2017-30-24      0       0       0       1       1
2017-30-A       0       0       0       1       1
519946  1       1       1       0       1
572111  0       1       1       1       0
erinyoung commented 4 months ago

Now for some bad news.

I ran your samples by adjusting several of panaroo's parameters listed at https://gtonkinhill.github.io/panaroo/#/gettingstarted/params

I didn't see a big difference in the core genome size. When I looked at the results from snp-dists, I don't see a real difference there either.

The SNP matrices of my exploration today.

# setting -f to 0.8 (core = 3,248)
==> 0.8_grandeur/snp-dists/snp_matrix.txt <==                       
snp-dists 0.8.2 2017-30-A   519946  Proteus_mirabilis_GCF_000069965.1   2017-30-24  2017-30-06  572111
2017-30-A   0   73  16890   207 201 48
519946  73  0   16413   278 274 90
Proteus_mirabilis_GCF_000069965.1   16890   16413   0   17111   17095   16373
2017-30-24  207 278 17111   0   8   253
2017-30-06  201 274 17095   8   0   249
572111  48  90  16373   253 249 0

# setting -f to 0.9 (core = 3,240)
==> 0.9_grandeur/snp-dists/snp_matrix.txt <==                       
snp-dists 0.8.2 2017-30-A   572111  2017-30-24  Proteus_mirabilis_GCF_000069965.1   519946  2017-30-06
2017-30-A   0   48  207 16055   77  201
572111  48  0   253 15546   94  249
2017-30-24  207 253 0   16276   282 8
Proteus_mirabilis_GCF_000069965.1   16055   15546   16276   0   15580   16260
519946  77  94  282 15580   0   278
2017-30-06  201 249 8   16260   278 0

# default grandeur (core = 3,257)
==> grandeur/snp-dists/snp_matrix.txt <==                       
snp-dists 0.8.2 Proteus_mirabilis_GCF_000069965.1   2017-30-24  2017-30-A   572111  519946  2017-30-06
Proteus_mirabilis_GCF_000069965.1   0   18694   18468   17988   17992   18673
2017-30-24  18694   0   207 289 278 8
2017-30-A   18468   207 0   84  73  201
572111  17988   289 84  0   126 285
519946  17992   278 73  126 0   274
2017-30-06  18673   8   201 285 274 0

# using --clean-mode moderate (core = 3,264)
==> moderate_grandeur/snp-dists/snp_matrix.txt <==                      
snp-dists 0.8.2 2017-30-A   Proteus_mirabilis_GCF_000069965.1   572111  2017-30-06  519946  2017-30-24
2017-30-A   0   18469   84  201 73  208
Proteus_mirabilis_GCF_000069965.1   18469   0   17989   18674   17993   18696
572111  84  17989   0   285 126 290
2017-30-06  201 18674   285 0   274 9
519946  73  17993   126 274 0   279
2017-30-24  208 18696   290 9   279 0

# not using -a (core =  3,257)  
==> noa_grandeur/snp-dists/snp_matrix.txt <==                       
snp-dists 0.8.2 2017-30-24  2017-30-06  2017-30-A   Proteus_mirabilis_GCF_000069965.1   572111  519946
2017-30-24  0   8   207 18694   289 278
2017-30-06  8   0   201 18673   285 274
2017-30-A   207 201 0   18468   84  73
Proteus_mirabilis_GCF_000069965.1   18694   18673   18468   0   17988   17992
572111  289 285 84  17988   0   126
519946  278 274 73  17992   126 0

# using --clean-mode sensitive (core = 3,264)
==> sensitive_grandeur/snp-dists/snp_matrix.txt <==                     
snp-dists 0.8.2 2017-30-06  2017-30-A   Proteus_mirabilis_GCF_000069965.1   519946  2017-30-24  572111
2017-30-06  0   201 18674   274 9   285
2017-30-A   201 0   18469   73  208 84
Proteus_mirabilis_GCF_000069965.1   18674   18469   0   17993   18696   17989
519946  274 73  17993   0   279 126
2017-30-24  9   208 18696   279 0   290
572111  285 84  17989   126 290 0
erinyoung commented 4 months ago

Looks like I can't use bugseq, so I can't compare refmlst. It looks like it uses minimap2 to identify genes of interest and it maps reads onto those genes.

whottel commented 4 months ago

Thanks for running all those tests, but still not a clear picture of what is going on here. I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.

erinyoung commented 4 months ago

Thanks for running all those tests, but still not a clear picture of what is going on here. I am surprised by the ppanggolin result showing everything within 0-1 SNP despite supposedly including the most genes in the analysis.

My apologies for my late response.

I've copied the gene_presence_absence.Rtab for panaroo and roary.

Roary uses blast to see if genes are the same between organisms. Panaroo uses CD-hit. As such, they determine different genes are shared between the samples.

For example, Roary determines arnC_1 is shared by all these samples, while panaroo does not. Similarly, arnC_4 is predicted to be shared by all samples with panaroo, but not roary. I think roary might be more effective at grouping similar genes together, which panaroo is grouping genes that might be more distant, hence why there are more SNPs.

The gene presence/absence files: panaroo_gene_presence_absence.Rtab.txt roary_gene_presence_absence.Rtab.txt

I have not found a combination of parameters for panaroo to adjust it such that it matches that of roary.

Pathogen Surveillance uses PIRATE (https://github.com/nf-core/pathogensurveillance/tree/dev) which is also blast based, so this workflow might be helpful to you. I would like to add PIRATE to my comparison, but I have not gotten to work with my local environment (perl issues with singularity).

The next version of grandeur (https://github.com/UPHL-BioNGS/Grandeur/pull/191) will support using roary instead of panaroo by setting the param.aligner to roary (--aligner roary on the command line) so you can continue to use roary.

erinyoung commented 4 months ago

I think bactopia also has a pirate option for pangenome analysis too.

erinyoung commented 4 months ago

Also, I JUST FOUND THIS PAPER/TOOL (https://github.com/maxgmarin/panqc). This should add some additional QC to panaroo. I'm going to try this tomorrow (or sometime soon).