CDCgov / phoenix

🔥🐦🔥PHoeNIx: A short-read pipeline for healthcare-associated and antimicrobial resistant pathogens
Apache License 2.0
55 stars 19 forks source link

MLST calling wrong scheme #130

Closed vascokarla closed 6 months ago

vascokarla commented 8 months ago

Describe the bug The MLST (Multi-Locus Sequence Typing) analysis is yielding conflicting results for a sample that was classified as Escherichia coli but is being assigned the aeromonas scheme (ST2363) instead of the ecoli(Achtman) scheme (ST410).

Impact This bug is causing confusion and uncertainty about the true taxonomic identity and ST of the sample, which is critical for downstream analyses and interpretations of the sequencing data. We don't know if it's caused by a sequencing error or a software bug.

To Reproduce Steps to reproduce the behavior:

  1. Environment: [HPC]

  2. Pipeline Version: [PHoeNIX v2.0.2, CHECK_MLST: python: 3.7.12, MLST: mlst: 2.23.0, mlst_db: '2023-07-28']

  3. Command:

    # run phoenix
    nextflow run $phoenix_path -profile singularity -entry PHOENIX --input $manifest --kraken2db $kraken2db --outdir $outdir/phoenix --max_cpus $threads --max_memory $memory
  4. Error Message: None [Pipeline completed successfully]. Sample warnings "Average Q30 of raw R1 reads <90.00%, <50% of reads assigned to top genera hit (11.09%), Check 1st MLST scheme matches taxa IDed."

Expected behavior The MLST analysis should consistently identify the sample as Escherichia coli, as per the initial taxonomic classification.

Screenshots When running MLST locally these are the tail results mlst_ecoli

Additional context The sample had 60X coverage, 108 contigs, assembly ratio 0.9640_(.5215)

nvlachos commented 8 months ago

Hi @vascokarla, Thanks for letting us know about this bug. This seems to be a recurring issue with the newest release of the MLST tool and how it calculates the best match. Ecoli uses an 8 allele set and when it matches the full set of 7 for other organisms, it reports that as the match. We've seen a couple pop in, but Aeromonas is most common). the good news is that wWe do have a fix, although slightly incomplete. I will patch it up and it should be included in the next release that should be out very shortly. I'll check back in with you once it is out to make sure it functions as expected!

vascokarla commented 8 months ago

Thank you so much for your response. Cannot wait for the new updates :) Karla

jvhagey commented 7 months ago

Hi @vascokarla a new version of phoenix has been released (v2.1.0) can you run it and confirm for us your issue is resolved now? Thanks!

vascokarla commented 7 months ago

Hi @jvhagey I was able to try the new version. It did identify the correct MLST scheme for E. coli! Though, we also tried with a new sample that was identified as Enterobacter hormaechei which was assigned the MLST scheme cronobacter (novel ST) using either PHoeNIx v2.0.2 and v2.1.0. We ran MLST (with filtered scaffolds) using the software mlst v2.22.0 (T. Seeman) and it was assigned the scheme ecloacae (novel ST). I see that there are exact matches with cronobacter...

I'm adding here the MLST stout for this sample if that helps

[11:37:13] This is mlst 2.22.0 running on linux with Perl 5.032001 [11:37:13] Checking mlst dependencies: [11:37:13] Found 'blastn' => /opt/conda/envs/mlst/bin/blastn [11:37:13] Found 'any2fasta' => /opt/conda/envs/mlst/bin/any2fasta [11:37:14] Found blastn: 2.12.0+ (002012) [11:37:14] Excluding 3 schemes: abaumannii ecoli vcholerae_2 [11:37:16] Found exact allele match cronobacter.pps-399 [11:37:16] Found exact allele match ecloacae.pyrG-39 [11:37:16] Found exact allele match aeromonas.gyrB-795 [11:37:16] Found exact allele match ecloacae.dnaA-62 [11:37:16] Found exact allele match ecloacae.gyrB-4 [11:37:16] Found exact allele match cronobacter.gyrB-100 [11:37:16] Found exact allele match cronobacter.atpD-211 [11:37:16] Found exact allele match cronobacter.gltB-187 [11:37:16] Found exact allele match cronobacter.infB-99 [11:37:16] Found exact allele match ecloacae.fusA-4 [11:37:16] Found exact allele match cronobacter.fusA-75 [11:37:16] Found exact allele match ecloacae.rpoB-44 [11:37:16] Found exact allele match ecloacae.rplB-4 XXXXXX.filtered.scaffolds.fa.gz ecloacae -dnaA(62) fusA(4) gyrB(4) leuS(~6) pyrG(39) rplB(4) rpoB(44) [11:37:16] Please also cite 'Jolley & Maiden 2010, BMC Bioinf, 11:595' if you use mlst. [11:37:16] Done.

jvhagey commented 6 months ago

@vascokarla, thanks for the info are you able to test the fix in v2.1.1-dev and confirm it fixes this issue. So the command would be nextflow run cdcgov/phoenix -r v2.1.1-dev -profile singularity -entry PHOENIX --input $manifest --kraken2db $kraken2db --outdir $outdir/phoenix --max_cpus $threads --max_memory $memory

vascokarla commented 6 months ago

Hi @jvhagey. The MLST scheme was correct this time using the version v2.1.1-dev for both E. coli and Enterobacter. I'm sharing a part of the results for this. Thank you so much for your quick help with this!

Species Taxa_Confidence Taxa_Coverage Taxa_Source Kraken2_Trimd Kraken2_Weighted MLST_Scheme_1 MLST_1 MLST_Scheme_2 MLST_2
Escherichia coli 99.98 ANI_match 99.49 ANI_REFSEQ Escherichia(10.41%) coli(8.96%) Escherichia(97.19%) coli(97.19%) ecoli(Achtman) ST410 ecoli_2(Pasteur) Novel_allele
Enterobacter hormaechei 99.39 ANI_match 90.61 ANI_REFSEQ Enterobacter(79.94%) hormaechei(13.71%) Enterobacter(94.70%) hormaechei(93.28%) ecloacae Novel_allele - -
jvhagey commented 6 months ago

Thank you, the patch will be released this week.