Mykrobe-tools / mykrobe

Antibiotic resistance prediction in minutes
MIT License
102 stars 26 forks source link

species and lineage predictions #181

Open alantsangmb opened 6 months ago

alantsangmb commented 6 months ago

Hello, I have recently obtained data on several Mtb lineage 5 and M. canetti strains, and I have been using Mykrobe v0.13.0 to predict their species and lineage.

In the case of the Mtb lineage 5 samples (ERR702413, ERR751348 and ERR1023217), Mykrobe consistently predicted the species as Mycobacterium tuberculosis instead of Mycobacterium_tuberculosis_variant_africanum, with a lineage assignment of lineage 5. I noticed that the reported species_per_covg for these samples was relatively low, around 22 to 24. I am wondering if this discrepancy could be attributed to the targeted probe used for detecting M. africanum, which might have been designed based on lineage 6. Is there a parameter available that would allow me to set a cutoff for the species_per_covg?

Additionally, I examined the M. canetti data (ERR266120) using Mykrobe, and I found that it correctly identified it as Mycobacterium canettii. However, the predicted lineage was reported as lineage 4 instead of an "unknown" lineage, as I have seen with other M. canetti strains. I would greatly appreciate it if you could shed some light on the possible reasons for this discrepancy. I must admit that I am not very familiar with the lineage prediction algorithm employed by Mykrobe.

Furthermore, I observed that the "lineage_per_covg" and "lineage depth" were displayed as NA for all samples I have tested. I am interested in obtaining a value for this parameter. Could you kindly advise me on how I can acquire and interpret this information? Thank you very much for your assistance, and I look forward to your response.

iqbal-lab commented 6 months ago

Hi there, @martinghunt, who recently updated our lineage scheme , is off sick at the moment, apologies for the delay, but we will reply ASAP.

alantsangmb commented 6 months ago

Thank you for your prompt response. There is no need for any apologies. I genuinely appreciate your time and assistance thus far. Please take all the time you need, and I hope @martinghunt recover quickly. Thank you once again for your support.

martinghunt commented 6 months ago

Thanks @alantsangmb. I'll have a look into it this week ...

martinghunt commented 5 months ago

Thanks for reporting these bugs. I'm still working on it, but here's an update...

Yes you're right, the africanum probes are lineage 6. It's turned out to be non-trivial to fix but will get there. We've got an automated method to build from GTDB genomes, but simply forcing a lineage 5 sample in there resulted in breaking a bunch of other species calls. It's going to need a fair bit more work and testing.

Re the canetti run ERR266120: are you using the CSV file? I'd recommend using the JSON file instead because you get more details. This does indeed need fixing, but is clearer what is happening from the JSON. The SNP call that defines lineage4 is low quality and fails mykrobe's filters, but is incorrectly being used to call the lineage, which means it ends up in the CSV file. The JSON has this:

"filter": [
    "LOW_PERCENT_COVERAGE",
    "LOW_TOTAL_DEPTH"

Similar comments for your question about "lineage_per_covg" and "lineage depth": I would use the JSON file instead because it has the proper details. We're considering removing those two columns from the CSV file.

martinghunt commented 5 months ago

For the record (and future me reading this), attaching JSON files of mykrobe version 0.13.0 run using the at time of writing latest panel 20230928/202309.

ERR751348.myk_0.13.0.json ERR266120.myk_0.13.0.json ERR702413.myk_0.13.0.json ERR1023217.myk_0.13.0.json

alantsangmb commented 5 months ago

Thank you so much for the detailed explanation. Yes, I was using CSV file. In the future, I will try using JSON file.

I am currently trying to comprehend the contents of the JSON files. I was wondering if the percent_coverage for the reference and alternative under lineage call should always add up to 100%. It seems that is not the case for ERR1023217 and ERR266120.

Regarding ERR266120, I noticed it has 40% for the alternative and 0% for the reference. Do you happen to know where the remaining 60% went? Moreover, I am interested in understanding the conditions that would trigger the filters for "LOW_PERCENT_COVERAGE" and "LOW_TOTAL_DEPTH." I am also curious about the reason behind the median depth being 0, while the minimum non-zero depth is 141 for the "alternate". Sorry for hitting you with so many questions at once.

iqbal-lab commented 5 months ago

Quick comment here.

I was wondering if the percent_coverage for the reference and alternative under lineage call should always add up to 100%

Definitely not. These numbers are telling what what % of the reference probe has been observed, and what % of the alt probe has been observed. The probe is a sequence, converted to a list of kmers, and we ask what % of the kmers in that sequence have been seen in the data.

Regarding ERR266120, I noticed it has 40% for the alternative and 0% for the reference. Do you happen to know where the remaining 60% went?

It means probably that the alt is present, but there is a SNP nearby

I am also curious about the reason behind the median depth being 0, while the minimum non-zero depth is 141 for the "alternate".

If the probe is 60 kmers say, and 59 of them are never seen (so coverage on those kmers is zero) but the last kmer is seen 141 times, then the median depth is zero , but the minimum non-zero depth is 141