Mykrobe-tools / mykrobe

Antibiotic resistance prediction in minutes
MIT License
106 stars 28 forks source link

Mykrobe overcounts homopol deletions when making probes #148

Open iqbal-lab opened 2 years ago

iqbal-lab commented 2 years ago

The probe generator is making one probe for each 1bp and 2bp deletions in katG and pncA. But if you make a all 1bp deletions in a homopolymer, they are essentially the same. eg deleting one A from AAA could happen in 3 places and give the same sequence. This leads to annoying things where we report >1 mutation being detected, when it is basicvally the same thing detected twice/whatever.

Workaround suggestions

  1. compare all the indel/frame-shift probes against each other and dedup
  2. pay attention to whether you are in a homopolymer when creating the 1bp and 2bp deletions, and don't overcreate probes
mbhall88 commented 2 years ago

For context. We have this example

            "Isoniazid": {
                "predict": "R",
                "called_by": {
                    "katG_GC1037G-GC2155074C": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -711.0227088947968,
                            -417.9483717274669
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 175,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 293
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1038C-GG2155073G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -727.5071540961122,
                            -377.76816030583126
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 160,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 350
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1039C-GG2155072G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -729.808268296947,
                            -372.51398695693916
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 158,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 357
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

The three mutations in the JSON above could concieveably be the same deletion...or a 2bp deletion I guess.

GC1037G
CC1038C
CC1039C

I guess it probably is a single 1bp deletion - at 1038.

Although I'm a bit confused about how the mutations work. Which of the two bases is the one at the position?

If I pull out these three bases, with one base flanking, from the katG reference, I get G GCC T. So if the position describes the first base, then 1039 should read CT1039C right? But if the second base describes the position, it's the same problem right?

martinghunt commented 2 years ago

The sequence at 1038-1040 in katG is CCC. image

mbhall88 commented 2 years ago

Hmmm, wonder why I got a different sequence.