CFSAN-Biostatistics / shigatyper

CFSAN Shigella Typing Pipeline
Other
14 stars 6 forks source link

Wrong prediction of Shigella Boydii serotype 20 #6

Open wolthuisr opened 2 years ago

wolthuisr commented 2 years ago

Hi,

I am using the ShigaTyper tool to analyze multiple shigella subtypes. One of the subtypes I am interested in is Shigella Boydii serotype 20. I noticed that if there is a heparinase hit the tool is supposed to return Shigella boydii serotype 20(line 591).

We looked at the samples manually and matched the gene sequence to the sample sequence, as expected the gene is within the samples, but the ShigaTyper script does not seem to recognize these hits and instead identifies the samples as Shigella boydii serotype 1.

There might be more users that will get this wrong prediction so I was wondering if there is an explanation for this and whether it can be fixed.

Looking forward to a response!

Kind regards, Roxanne

florathecat commented 2 years ago

Thank you Roxanne for your email. I’ll look at the script - it was able to identify S. Boydii 20 in our test.

We note that for a gene to be included in the typing scheme, a minimum of 20% gene coverage is required. I may come back at you requesting the raw data of your strain for a definitive explanation.

Thanks again for helping us improve our script.

Yun

On Jan 25, 2022, at 11:46 AM, Roxanne Wolthuis @.***> wrote:

 Hi,

I am using the ShigaTyper tool to analyze multiple shigella subtypes. One of the subtypes I am interested in is Shigella Boydii serotype 20. I noticed that if there is a heparinase hit the tool is supposed to return Shigella boydii serotype 20(line 591).

We looked at the samples manually and matched the gene sequence to the sample sequence, as expected the gene is within the samples, but the ShigaTyper script does not seem to recognize these hits and instead identifies the samples as Shigella boydii serotype 1.

There might be more users that will get this wrong prediction so I was wondering if there is an explanation for this and whether it can be fixed.

Looking forward to a response!

Kind regards, Roxanne

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

wolthuisr commented 2 years ago

Hi Yun,

We used some public samples with accession numbers SRR3020611 & SRR5330512 (ENA). For these samples we don't find results on the Heparinase gene.

Hope this could help explain the issue!

Roxanne

florathecat commented 2 years ago

Hi Roxanne,

That’s very useful information SRR3020611 was the founding strain used to develop the script in jupyter (ipython). If the script in bio conda doesn’t recognize the heparinase gene, there must be some file corruption while we convert the files. I’ll look at it.

Yun

On Jan 27, 2022, at 9:06 AM, Roxanne Wolthuis @.***> wrote:

 Hi Yun,

We used some public samples with accession numbers SRR3020611 & SRR5330512 (ENA). For these samples we don't find results on the Heparinase gene.

Hope this could help explain the issue!

Roxanne

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.

rpetit3 commented 2 years ago

Should a reference gene for heparinase be in https://github.com/CFSAN-Biostatistics/shigatyper/blob/master/shigatyper/resources/ShigellaRef5.fasta?

If so there isn't one, and likely the cause of this issue

rpetit3 commented 2 years ago

Did some testing.

Without heparinase in reference fasta

sample  prediction      ipaB
SRX1486859      Shigella boydii serotype 1      -

added this sequence for heparinase (https://www.ncbi.nlm.nih.gov/nuccore/CP016036.1?from=2803&to=4428&report=fasta&strand=2) to the ShigellaRef5.fasta file, and it gets serotype 20

sample  prediction      ipaB
SRX1486859      Shigella boydii serotype 20     -
rpetit3 commented 2 years ago

Haha final comment.

Comparing the genes in ShigellaRef5.fasta and Table 2 in the paper (https://journals.asm.org/doi/10.1128/AEM.00165-19), the following sequences are in Table 2 and not ShigellaRef5.fasta

Heparinase
Sat_N
ShET1
ShET2
Stx1
Stx2

Of these I think only Heparinase is used by ShigaTyper

rpetit3 commented 2 years ago

Hi @wolthuisr

This should be fixed in v2 of Shigatyper.

Cheers

florathecat commented 1 year ago

Haha final comment.

Comparing the genes in ShigellaRef5.fasta and Table 2 in the paper (https://journals.asm.org/doi/10.1128/AEM.00165-19), the following sequences are in Table 2 and not ShigellaRef5.fasta

Heparinase
Sat_N
ShET1
ShET2
Stx1
Stx2

Of these I think only Heparinase is used by ShigaTyper

I see that the current version of shigatyper does not contain shigatoxins and enterotoxins like the later version I included in the paper. It was primarily because the output I originally envisioned using ipython/Jupyter notebook is different what most people prefer in a server environment. So the current shigatyper only gives you a single output of a serotype. (And we debated over whether we should included heparinase for S. boydii 20 in the paper or for another paper). I am not as code-savvy as the CFSAN guys or most ppl on Github. Please let me know how helpful/informative if the script output includes another column for toxins identified?