Clinical-Genomics / microSALT

Microbial Sequence Analysis and Loci-based Typing pipeline for use on NGS WGS data.
GNU General Public License v3.0
2 stars 3 forks source link

Issue with reporting resistance OXA-48 #149

Open talnor opened 2 years ago

talnor commented 2 years ago

Describe the bug Perfect matches to OXA-48 are no longer reporter in microSALT, instead microSALT reports OXA-566 resistance of lower %identities.

Message from KS in Ticket 722237:

...angående att OXA-48 har börjat rapporteras som OXA-566 (identity 99,88%). Kör jag fastq-filerna från er i Resfinder så rapporterar den OXA-48 (100% match). Jag skulle då meddela ungefär när denna förändring inträffat och vad jag kan se så rapporteras prov 20ET200231 (Ticket 112544, v44 2020) som OXA-48 medan prov 20ET500236 (Ticket 935485, v47 2020) rapporteras som OXA-566. Är det något som ändrats i assembly under denna period?

To Reproduce Steps to reproduce the behavior:

  1. Run sample 20ET200231 or 20ET500236 with microSALT > 3.1.3 and microSALT 3.1.0

Expected behavior OXA-48 should be reported if this resistances has higher %identity.

Additional context Is OXA-48 in the raw blast results?

Adding some of the commits made during the period below. Has any affected OXA-48 reporting? Scrape_blast changes likely.

talnor commented 2 years ago

Currently blocked by installation issues

moahaegglund commented 1 year ago

No longer blocked, installation issues solved in #142.

pbiology commented 1 year ago

Will be resolved in Clinical-Genomics/CG-JASEN#5

samuell commented 2 months ago

Re-opening this, as it continues to cause problems, and we might need to prioritize it if Jasen is not completed soon.

samuell commented 2 months ago

I have been running Spades (4.0.0) assembly + Blasting against the resfinder DB semi-manually, using the same commands as used in microSALT, both with and without the --isolate flag to Spades, and the OXA-48 gene is reported correctly with both methods.

This thus strengthens the suspicion that the error might be in the scraping logic.

(Unless there are some differences between Spades 3.13.1 used in microSALT right now, and 4.0.0.)

samuell commented 2 months ago

(Unless there are some differences between Spades 3.13.1 used in microSALT right now, and 4.0.0.)

Tested this also with Spades 3.13.1 with the same result, so that is apparently not the issue.

samuell commented 2 months ago

I have now confirmed that an existing OXA-48 is removed in this part of the code (for cleaning up overlapping hits): https://github.com/Clinical-Genomics/microSALT/blob/2f8b19edd545c327506d629cdb2833146c10fa1a/microSALT/utils/scraper.py#L344-L410

Perhaps there is an overlap with OXA-566 ... we'll see.

samuell commented 2 months ago

And in one of these commits:

deed81b57a9301fa9cd87491e71d8f5445f59da9 Removed newline char in log ab280db6b4c1dd29fb44e04157fa9fb7d809e59d Update pull_request_template.md 7e38873f902c219a3db4dd65fa4dfcb817c2dcad Clarified logic in referencer 9757f6b5690da7dbfec5a1c1d3d3bcb89ab4c661 Updated and fixed bugs in pubmlst and ncbi downloads 27e4685e2210c4893b5b99e7c66ab1fc5f4d82ea Merge pull request #121 from Clinical-Genomics/resistance-scrape 170b013b12cb95ea14bd4bb8f5b115061e16a0d5 Version bump b4e1231a3901cc9181982f243d7ebbc0ebcbfb45 Added padder safety 5749721faf03743a971341c78fa1650d5e08e330 Quast backwards fix 2e626ef3be9440229001911710b0a81e68b8e7ba Rolledback expec locilength generous lookup 395a218cef8e43681e1d358ea4d68b1ff9a6f5bf Locilengths now ignores reference name a05ef8949444a89de8d81d25058123e744ae14ce Update init.py 3f1488cd655fc5e0fd75ef44ebc845475d7f5884 Attempt to make resistance searches more compatible

samuell commented 2 months ago

It is introduced with this one:

395a218cef8e43681e1d358ea4d68b1ff9a6f5bf Locilengths now ignores reference name

(Tested by cherry-picking the commits one by one into a test branch, using a real-world resistance blast output containing both OXA-566 and OXA-48 instead of the test one, and looking for occurrences of OXA-48 in caplog.text).

samuell commented 1 month ago

I think I'm starting to have a reasonable test now, that now fails (before implementing the fix), saying:

>     assert "blaOXA-48" in genes
E     assert 'blaOXA-48' in ["aph(3')-III", 'ant(6)-Ia', 'blaOXA-566']

(See https://github.com/Clinical-Genomics/microSALT/actions/runs/9898158769/job/27344305202 )

This is triggered by using some real-world example data where both OXA-48 and OXA-566 are included. But re-assuring to see that we reproduce the inclusion of blaOXA-566 there, as the issue described.

And, with the fix implemented, this test passes.

samuell commented 2 weeks ago

Just for the rec: As can be seen in the links above or here, the cause turned out to be a regex that was too unspecific, so that it would match more than the specified gene name... e.g. searching for blaVIM-4 would also match blaVIM-488 or similar. The fix makes the regex more specific by requiring a match against the underscore character coming after the gene name in blast output.