CFSAN-Biostatistics / shigatyper

CFSAN Shigella Typing Pipeline
Other
14 stars 6 forks source link

Serotype prediction for Shigella flexneri #11

Closed nokcs closed 1 year ago

nokcs commented 2 years ago

Hi Shigatyper team,

The tool and reference allele files are very handy. Recently, we have assembly data and tried to use the "rules" from shigatyper to assign serotype based on the "hits." However, we are unsure if we misunderstood some points.

For example, given Hits = [ipaH_c, gtrI, gtrIC, ipaB, Sf_wzx, Sf_wzy], ipaH and ipaB are then considered and removed in the beginning part. Now Hits = [gtrI, gtrIC, Sf_wzx, Sf_wzy] and we are at line 422 because Sf_wzx is in Hits. Then we are at line 444 and Sf_wzy is removed, so now Hits = [gtrI, gtrIC, Sf_wzx]. Because Sf_wzx is still in Hits, so Hits will not be equal to any of the Shigella flexneri serotype in the SfDict (line 464), and also will not be a subset (line 466).

My guess is we have serotype 1c (7a) because it has both gtrI and gtrIC but not Oac1b. However, because Sf_wzx is still in Hits, our isolate will be predicted as a novel serotype. Could you help to clarify this point, please?

Thank you very much!

florathecat commented 1 year ago

Hello,

I am on travel and can’t look at the lines now. But if I remember it correctly, ipaH_c and ipaB are removed early on after the strain is determined to be a Shigella with the pINV plasmid. Sf_wzx is first removed after the strain is determined to be a S. flexneri followed by Sf_wzy (sf_wzy is not used for determination purpose). the rest of the hits are screened in the DIC sub-serotype determination.

I’ll look at the lines after I go back. Hope this helps,

Yun

On Apr 25, 2022, at 3:38 AM, nokcs @.***> wrote:

 Hi Shigatyper team,

The tool and reference allele files are very handy. Recently, we have assembly data and tried to use the "rules" from shigatyper to assign serotype based on the "hits." However, we are unsure if we misunderstood some points.

For example, given Hits = [ipaH_c, gtrI, gtrIC, ipaB, Sf_wzx, Sf_wzy], ipaH and ipaB are then considered and removed in the beginning part. Now Hits = [gtrI, gtrIC, Sf_wzx, Sf_wzy] and we are at line 422 because Sf_wzx is in Hits. Then we are at line 444 and Sf_wzy is removed, so now Hits = [gtrI, gtrIC, Sf_wzx]. Because Sf_wzx is still in Hits, so Hits will not be equal to any of the Shigella flexneri serotype in the SfDict (line 464), and also will not be a subset (line 466).

My guess is we have serotype 1c (7a) because it has both gtrI and gtrIC but not Oac1b. However, because Sf_wzx is still in Hits, our isolate will be predicted as a novel serotype. Could you help to clarify this point, please?

Thank you very much!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

lexleong commented 1 year ago

Hi Yun, We have the same issue as @nokcs. All our S. flexneri has been identified as novel serotype, when we know that they are not novel. Cheers

florathecat commented 1 year ago

Hi Lexleong,

Strange. I thought we fixed it. Maybe we didn’t. I’ll look at it and communicate with Justin.

Yun

On Oct 27, 2022, at 1:58 AM, lexleong @.***> wrote:

 Hi Yun, We have the same issue as @nokcs. All our S. flexneri has been identified as novel serotype, when we know that they are not novel. Cheers

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

dwinter commented 1 year ago

Hi all,

We noticed that Shigella that were given a serotype in the past are now becoming 'novel' with more recent versions of shigatyper.

A bit of sleuthing on a couple of test cases shows this change occurred with v1.07

As far as I can tell the isolate.csv file produced from both versions in the same.

The followin results are given as 'novel' by 1.0.7 and Shigella flexneri serotype 1b by 1.0.6

Happy to share .fq files if it's a help (david.winter@esr.cri.nz)

,Hit,Number of reads,Length Covered,reference length,% covered,Number of variants,% accuracy
0,ipaH_c,4111,779.0,780,99.9,0.0,100.0
1,ipaB,20,712.0,1743,40.8,0.0,100.0
2,Sf_wzx,360,1253.0,1257,99.7,0.0,100.0
3,Sf_wzy,139,1106.0,1149,96.3,0.0,100.0
4,gtrI,562,1518.0,1521,99.8,0.0,100.0
5,Oac,5,0.0,1002,0.0,0.0,
6,Oac1b,692,999.0,1002,99.7,0.0,100.0
crashfrog commented 1 year ago

Thanks everyone, this should be fixed in 2.0.3, which is now out in Bioconda. Update in GalaxyTrakr to come soon.

lexleong commented 1 year ago

Hi Yun, Justin and colleagues, We have tried shigatypere v2.0.3 with our flexneri, and now it is not coming out with any Shigella prediction for flex anymore. It showed other shigellas but not flex. I have shared an example output file of the shigatyper.

Sample | Prediction | ipaB | Notes
-- | -- | -- | --
17701718 |   | + | this strain is ipaB+, suggesting that it retains the virulent invasion   plasmid.
17701888 | Shigella sonnei form II | - |  
17702745 |   | + | this strain is ipaB+, suggesting that it retains the virulent invasion   plasmid.
17703685 |   | + | this strain is ipaB+, suggesting that it retains the virulent invasion   plasmid.
17703697 |   | + | this strain is ipaB+, suggesting that it retains the virulent invasion   plasmid.
2228100185 |   | + | this strain is ipaB+, suggesting that it retains the virulent invasion   plasmid.
kapsakcj commented 1 year ago

Commenting to say the same - I'm using shigatyper v2.0.3 and seeing the same phenomenon where the main output TSV does not list a serotype for most Shigella flexneri samples (but does so for other Shigella species).

I've tested with numerous Shigella flexneri serotypes and the only execption where ShigaTyper does produce the genus/species/serotype is for Shigella Flexneri serotype 6 (example at the end)

Example TSV:

sample     prediction  ipaB  notes
my-sample              +     this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.

Example hits TSV:

   Hit     Number of reads  Length Covered  reference length  % covered  Number of variants  % accuracy
0  ipaH_c  5709             780             780               100.0      8.0                 99.0
1  ipaB    645              1721            1743              98.7       10.0                99.4
2  Sf_wzx  534              1253            1257              99.7       0.0                 100.0
3  Sf_wzy  311              1148            1149              99.9       0.0                 100.0
4  gtrII   754              1449            1461              99.2       1.0                 99.9

In this above example, I expected Shigatyper to say that this was a Shigella flexneri 2a in the "prediction" column


2nd example - Shigella flexneri 6:

sample           prediction                    ipaB  notes
my-other-sample  Shigella flexneri serotype 6  +     this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.

Hits TSV - Shigella flexneri 6:

   Hit      Number of reads  Length Covered  reference length  % covered  Number of variants  % accuracy
0  ipaH_c   4046             779             780               99.9       5.0                 99.4
1  ipaB     942              1655            1743              95.0       21.0                98.7
2  Sf6_wzx  508              1228            1233              99.6       0.0                 100.0
3  Sf6_wzy  513              1183            1188              99.6       0.0                 100.0
kapsakcj commented 1 year ago

To add a bit more context, when I re-tried the same sample from my first example ^ with shigatyper v2.0.2, it predicted Shigella flexneri, novel serotype in the "prediction" column of the output TSV.

   Hit     Number of reads  Length Covered  reference length  % covered  Number of variants  % accuracy
0  ipaH_c  5709             780             780               100.0      8.0                 99.0
1  ipaB    645              1721            1743              98.7       10.0                99.4
2  Sf_wzx  534              1253            1257              99.7       0.0                 100.0
3  Sf_wzy  311              1148            1149              99.9       0.0                 100.0
4  gtrII   754              1449            1461              99.2       1.0                 99.9
sample     prediction                         ipaB  notes
my-sample  Shigella flexneri, novel serotype  +     this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.
florathecat commented 1 year ago

I have a feeling it is python acting out on the dictionary for flexineri serotyping when we updated it in version 2.02. for @dwinter there is an additional that the hit "Oac" was not removed from the list entering the prediction.

crashfrog commented 1 year ago

Fixed in 2.0.4 .