GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Some minor bugs in the catalog #6

Closed jeremyButtler closed 3 months ago

jeremyButtler commented 3 months ago

Thank you for your quick response on my last issue.

This is not an issue, but more a couple errors in the 2023 database I noticed while converting it. I am only mentioning these to make sure that you are aware of them.

The first is that there are entries in the master catalog that have a "See genome indices" note, but that can not be found in the genome indices tab. These are all in the form of "gene_deletion". Here are the four I know of. All of these four entries are resistant to at least one antibiotic (I ignored anything that did not confer antibiotic resistance, so there might be more there).

The other minor issue I found is the the "gene_LoF" variants have no unique identifier in the genome indices tab. I can find them in the genome indices tab, but I have no way of telling them apart from other LoF's for the same gene. I suspect that these LoF variants get duplicated when there are multiple "gene_LoF" tags for the same gene.

Sorry about being the person who is raising issues.

Thanks for all the hard work you all do.

sachalau commented 3 months ago

Dear Jeremy,

Thank you for your report.

Did you have a look at the Instruction of use for these files available as PDF in the same folder?

https://github.com/GTB-tbsequencing/mutation-catalogue-2023/blob/main/Final%20Result%20Files/Instruction%20of%20use%20for%20incorporation%20of%20the%20mutations%20catalogue%20version%202%20results%20into%20bioinformatic%20pipeline.pdf

Quoting:

We do not provide genomic-variants for deletion graded-variants (for instance, for the graded-variant “pncA_deletion”)

So you are right that the deletion graded variant entries do not appear in the genomic_coordinates sheet.

Regarding your second issue, it is correct that gene_LoF have more than one genomic_coordinate entry in the second sheet (as a matter of fact, not only gene_LoF entries are duplicated, there are plenty of missense variant that might appear more than once as well).

Quoting again :

Each graded-variant can be linked to more than one genomic-variant (for instance, different genomic-variants can lead to identical missense)

For LoF graded variants, this is expected because any start_gained, frameshift or start_lost variants are included in those entry. So for instance, pncA_LoF will appear many times in the genomic_coordinate file because all these variants start_gained, frameshift and start_lost that we have observed in our database will be there. So you will know if you identify one in your data that it's a pncA_LoF and you can then match it to the appropriate grading.

I hope that resolves your concerns

jeremyButtler commented 3 months ago

Thanks for your reply

I did read the pdf, but was more focused on figuring out how to parse the database. Also at that time I did not have enough of a grip of the naming system or the catalog to understand that section.

Thanks again for taking the time.