classification inconsitencies based on low coverage genes

gjordaopiedade commented 2 months ago

Hello, First I would like to thank you for your very well-built and comprehensive tool. :)

We are using shigatyper to characterize some Shigella and EIEC isolates but we are finding some inconsistencies that are difficult to interpret and are hoping you can help us make sense of it.

Our PCR typing and core genome phylogeny both suggest our isolate to be S.flexneri, yet shigatyper classifies it as EIEC+ (see the output below). We noticed that shigatyper probably tries to type it either as S.boydii or S. sonnei based on cadA, yet that gene has only 70% coverage and the presence of Sf_wzx suggests it to be S.flexneri.

We find this result hard to interpret and were wondering if, for these types of cases, we should ignore the shigatyper classification and go ahead with the phylogeny and gene presence/absence-based classification.

Thank you! Gonçalo

L0_V350090632_L04_46.tsv: sample prediction ipaB notes L0_V350090632_L04_46 EIEC + L0_V350090632_L04_46 is lacY+ or cadA+ but not one of the exception Shigella serotypes.;this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.

L0_V350090632_L04_46-hits.tsv: Hit Number of reads Length Covered reference length % covered Number of variants % accuracy 0 ipaH_c 13893 780 780 100.0 0.0 100.0 1 ipaB 3456 1743 1743 100.0 0.0 100.0 2 EclacY 7 500 1254 39.9 0.0 100.0 3 cadA 24 1508 2143 70.4 0.0 100.0 4 Sf_wzx 1275 1253 1257 99.7 0.0 100.0 5 Sf_wzy 920 1147 1149 99.8 0.0 100.0 6 gtrI 1639 1521 1521 100.0 0.0 100.0 7 gtrII 16 1089 1461 74.5 0.0 100.0 8 Oac1b 970 1001 1002 99.9 0.0 100.0 9 Xv 7 569 1521 37.4 0.0 100.0

florathecat commented 2 months ago

Hello Gonçalo,

Thank you for sharing your experience. You are correct that the reason your strain was classified as EIEC by Shigatyper was because it was lacY+ cadA+. In our limited experience (~150 S. flexneri strains during development and validation, and ~100 strains later through communication with other users), we have not encountered a S. flexneri strain that were lacY+ cadA+. While not being able to rule out that S. flexneri can be lacY+ cadA+, I would like to point out that EIEC are also ipaB+ and Shigella and E. coli are of the same ancestry and some E. coli strains do possess the same O-antigen observed in Shigella serotypes. And most of the S. flexneri serotype modifying genes can be acquired through bacteriophage infection (e.g., gtrI, gtrII, Oac1b) or extrachromosomal element (e.g. Xv).

I would like to propose a potential scenario based on the information you shared. It appears that, in addition to lacY and cadA genes, your strain also has multiple S. flexneri serotype modifying genes that are sometimes mutually exclusive. For example, gtrI, gtrII are bacteriophage-acquired genes that modify the core S. flexneri O-antigen (serotype Y) so that the cell could not be super-infected subsequently by another bacteriophage. It is possible that your strain may be a mixture of multiple Shigella serotypes (e.g., S. sonnei or an E. coli isolate whose O-antigen gene is not in the database + several different S. flexneri serotypes). The telltale is the number of reads you got: wzx and wzy are AT-rich and less efficient in PCR amplification than lacY and cadA. However, you have only 7 reads of lacY and 24 reads of cadA, but a much greater number of 1275 reads of Sf_wzx and 920 reads of Sf_wzy. Because WGS is such a sensitive technique, I can see how this messes up the prediction.

It is strongly recommended that at least two streak plates be performed for single colonies before DNA is isolated for WGC.) Also, during DNA extraction, it is recommended that samples are spaced out at least one empty well from each other to avoid droplet contamination during pipetting. This has happened to me before so I am not saying this simply to insult your basic microbiology or MolBio skills.

Thank you again for using ShigaTyper and sharing with us your experience. I hope my two-cents-worth above helps.

Yun

gjordaopiedade commented 1 month ago

Hello Yun,

Thank you for your careful explanation of the meaning of these results. We think your explanation is the most likely cause. We did follow the isolation and extraction procedure you mentioned. I guess something might have gone wrong along the way. Interestingly, many of other studies' published isolates seem to have the same issue which didn't help with putting the finger on the issue. We will probably try to assemble the reads to try to assess the issue further before discarding or re sequencing the samples.

Thank you for your guidance! Best wishes, Gonçalo

florathecat commented 1 month ago

Hi Gonçalo,

I am glad that I can be of some assistance to the greater Shigella community. :)

Contamination of small amount of sequencing reads is indeed a common problem for a highly sensitive assay. Do you have large sequencing files? My experience is that fastq.gz files of ~100 MB each are more than enough to get a good prediction. Maybe you can get around it by resampling your files into smaller files (and do a bootstrap of 100 resampling for a consensus?)?

At the time of building Shigatyper, I did consider the contamination issue. If the ratio of number of hits (say ipaH/cadA) is a good indicator, maybe those contaminating hits can be screened out using a threshold ratio. My problem was that most publicly available WGS files I could find are S. sonnei and S. flexneri 2 or 3. I didn't have enough samples for the rare serotypes for a comprehensive knowledge base. Hopefully somebody someday later can think of a way to solve the problem.

Closing the issue for now. Best of luck with your research.

Yun

CFSAN-Biostatistics / shigatyper

classification inconsitencies based on low coverage genes #18