GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Thanks for this resource, but could you provide a future proofed, machine readble, file #5

Closed jeremyButtler closed 3 months ago

jeremyButtler commented 4 months ago

I first want to thank you for providing such a valuable resource.

Right now I am writing a program in C to use this and I find that your format is not in an easy to read/process format for the excel sheet and the vcf does not have enough information. Also, the vcf is not in a very great format either for the information you are trying to convey. It often times merges items together in ways that requires extra processing steps, thus making processing harder.

I was able to use the 2021 catalog, when I saved the genome indices tab as csv, but I can not this with current catalog. I would appreciate if you also made a flat file, such as tsv or csv file that I could use and that this flat file would appear in all future catalogs. This way if anyone ever uses my program it would be future proofed. I am sure other people would appreciate this as well, as it would require less maintenance for them down the road.

Here is an example format that I thought of that would allow some flexibility for future updates, not require an excel reader to process, but also be rigid enough so that it does not break other peoples code when it is upgraded. Which, is what is currently happening to me.

For this tsv I would have every column hold one piece of datum. With the first column holding the reference accession number, the second column holding the gene name, the third column holding the position of the first base in the patterns on the reference, the fourth column holding the position of the first base in the codon (if is not a codon use NA). The fifth column holding the codon number, the six column telling if the gene is forward or reverse (so I know which bases to get in the codon), the seventh column holding the reference pattern, the eight column holding the AMR pattern, the ninth column holding the reference amino acid (single letter code, this is for machines not humans), the tenth column holding the AMR amino acid (single letter code), then the eleventh column till the "end-antibiotic-classes" column hold a single antibiotic. With each antibiotic having just the grade (no text, this is for machines not humans) or NA. The antibiotic sections ends with an "end-antibiotic-class" column, which is marked with an "*". The remaining columns are for anything extra that might be added in the future or other users might want.

Here is my logic. The first to tenth columns never change and have information that will be used the most. The antibiotic columns are structured in such a way that it is easy to add a future antibiotic class, but still be easy to parse. I just need to keep reading the header until I hit the "end-antibiotic-classes" column or a "*" for each mutation row. This allows you to add or remove future antibiotic drugs/classes if needed, while not breaking the users processing functions. Then the final columns contain extra information that might not always be used and can easily be ignored by reading till the new line or getting another line from the file. As a rule, once a column is added it can never be removed, unless you know no one ever uses that column. Otherwise you risk breaking peoples pipelines down the road.

For the reference and AMR pattern I need enough bases that I can easily identify the AMR without having to know it is an insertion or an deletion. You did this in the 2021 catalog and you might have done this in the 2023 catalog, no idea if you did.

Here is an example, maybe not the best, of the format, but it might give you an idea of what I am thinking.

reference gene-name reference-position codon-position-on-reference codon-number-gene gene-forward-or-reverse reference-pattern AMR-pattern Reference-amino-acid AMR-amino-acid antibiotic-1 antibiotic-2 antibiotic-3 ... antibiotic-n end-antibiotic-classes other useful data ...
NC000962.3 embB 4246533 4246433 10 F GA CC R T NA 5 1 ... NA * other useful data ...
NC000962.3 newGene 4243533 4243435 30 R A C R T NA 5 1 ... 5 * other useful data ...
NC000962.3 gene2 3243533 NA NA R A C NA NA NA 5 NA ... NA * other useful data ...

The nice thing about this format is it is very easy to process and does not need any extra libraries, which means less dependencies for others to worry about. It also would allow some flexibility for future updates, but at the same time be rigid enough to not break older code that uses the file when it is update or expanded. That way maintenance is focused on how to leverage the newly added columns then on I just want things to work.

Thanks for taking the time to read this and for producing this great resource.

sachalau commented 4 months ago

Dear Jeremy,

Thank you very much for your long comment.

Please now find in the result folder, two txt files, one for each of the sheet present in the Excel output. Columns are tab-separated.

Using those two, and additional the VCF if you feel the need, you should have all you need to import the result files programmatically.

As a I side note, I would encourage you to use specific libraries for dealing with xlsx files in the future. I'm not familiar with C, but have seen that a couple exist. In Python, I would encourage you to use xlrd and openpyxl.

I also thank you for your additional proposition. At the moment, I believe that all that all files provided in the outputs, including the genomic coordinates and the VCF, are sufficient to ensure a proper import by third parties using programmatic tools. If the logic you implement relies on using the genomic coordinate values to identify markers of resistance, we will ensure your logic is compatible with the future updates of the catalogue.

jeremyButtler commented 4 months ago

Thanks for taking the time to reply and I can understand your point. However, I would have never raised this idea if these files would have solved the problem. I could have easily made these my self using LibreOffice and so, it would not have been worth my time to mention anything.

You are right about there being excel readers in C. However, that is just one more dependency that can break or make the users life more complicated or cause licensing issues for others downstream of me. So I try to reduce down dependencies.

I will take some time to figure out how to make this database work. And again, thanks for providing this resource. it really is very valuable.

Again thanks for taking the time to respond to me and for responding so quickly. It means a lot to me to know were I stand and what I need to do to move forward. For now I will leave this open, just in case others think that it is worth the time.

jeremyButtler commented 3 months ago

For anyone that would like a file like I described here, here is the link to my converted tsv file.

https://github.com/jeremyButtler/who2023-TB-catalog-as-tsv

sachalau commented 3 months ago

Thanks Jeremy for the link to your resource.

I reviewed rapidly your data and documentation, and I'm concerned that you are still misinterpreting the data presented in the catalogue. I won't have time to delve too much in the details or repeat information available elsewhere, but I'd just like to re-emphasize that part :

From my issue with the WHO I was pointed to there using this database pdf. I did not understand this enough when I first read it, but it appears that this was done on purpose to allow multiple grades to be matched up (I am guessing grade 1 and 2 to be mixed). I have no idea why they did this and it does not make much sense to me.

As mentioned in the PDF and in an answer to another issue , duplicated entries are completely expected in the Genomic_coordinate data, and their point is certainly not to allow multiple grades to be matched up. Their point is to allow multiple different genomic variant to be matched to the same graded_variant. The primary reason variants are duplicated is because of the redundancy of the genetic code. I hope this example will clarify that :

variant chromosome position reference_nucleotide alternative_nucleotide
katG_p.Ser315Thr NC_000962.3 2155167 GC TG
katG_p.Ser315Thr NC_000962.3 2155167 GC AG
katG_p.Ser315Thr NC_000962.3 2155167 GC CG
katG_p.Ser315Thr NC_000962.3 2155168 C G
katG_p.Ser315Thr NC_000962.3 2155168 CTG GTT

These 5 genomic variants all lead to the same amino acid change. So any of these 5 genomic variants should be graded identically, to any (drug, pair) tuple associated with katG_p.Ser315Thr (could be more than one, although in that case there's only one grading associated with katG_p.Ser315Thr).

That's one reason variant are duplicated there. The second one is that all unseen frameshift, stop gained, start lost variants for any gene subject to an LoF expert/epistasis rule are merged into gene_LOF. The reasoning is identical, the aim being to help out the user with matching any LoF found in their data to the final grade.

Finally, regarding your Table 4 and missing variants in the Genomic_coordinates. We agree that deletion variants and some LoF variants are missing in the Genomic_coordinates.

However, if you searched for any of the other variant listed there (i.e. the stop lost variants, such as Rv0010c_p.Ter142Cysext?), you will find them. I think you matching logic incorrectly handles the "?" character. Maybe because you are using regex incorrectly or for some other reason.

I hope it will help you make sense of it.

jeremyButtler commented 3 months ago

Thanks for your feedback. It is interesting to know that the variant ID uses regular expressions. I thought the '?' was referring to a stop and then any AA change and that this same pattern would appear in the genome indices tab. I guess I got this idea because Ter215Arg? referred to a stop lost, p.Trp36 referred to a stop gained and that '' is often a shorthand symbol for a stop codon. I guess I got to eager to read the database and needed to look more closely at the table on page 91 of the pdf catalog and the using this catalog well enough. I will remove that section of the README.

It also sounds like I need to treat the LoF entries as general frame shifts. I will also need to handle the *? symbols if they add antibiotic resistance. It sounds like I need to sit and take some time to think after I get some other stuff done.

Thanks for taking the time.

It means a lot to me that you took the time to do a brief glance and update me on my mistakes. Thanks again.

sachalau commented 3 months ago

Sorry, I didn't mean to imply that the variant names used regex per se.

The reason why you incorrectly reported that entries such as pncA_p.Ter187Argext*? were missing in the genomic_coordinate files is the following.

In regex, having those two character *? sequentially one next to the other is an invalid pattern (you can check on https://regex101.com/, input *?, the pattern is invalid).

So at that command, when grep read the patterns from the tmp-filt.tsv files :

grep -v -f tmp-filt.tsv tmp.tsv | wc -l

all the patterns which had *? were incorrect, so per definition they cannot be found in your searched file.

(Please also note that in that case, the stop variant such as gid_p.Cys52*, is interpreted as a regex pattern, which means grep is not looking for the character * at the end of the pattern, but for the character "2" any number of times (0 or more). Similarly, the . is not interpreted as a character but as the regex operator).

For you to correct this behaviour, you need to escape all special characters that are regex operator so that you are crafting valid patterns for search. The fix is as so :

awk \
    '{print $1;}' \
    < WHO-2023-TB-tab2-indices.tsv \
  |  sed "s/*/\\\*/g" 
  | sed "s/?/\\\?/g" 
  | sed "s/./\\\./g"
 > tmp-filt.tsv;

That way, none of .?* are interpreted as regex operator.

Finally, regarding our variants name, as I mentioned, they are crafted via SnpEff (https://pcingola.github.io/SnpEff/), which is based on HGVS (https://hgvs-nomenclature.org/stable/).

You can refer to their documentation, but as I mentioned, the HGVS does not include regex concepts in their nomenclature, it just happens that they use characters that are also regex operators and should be correctly handled when using grep.

jeremyButtler commented 3 months ago

Thanks for this correction it gave me a bit of a surprise concerning grep. I thought that grep only did exact pattern matching unless you specified -e or -E for regular expressions. Once I removed the "*?" from the search pattern it looks like grep did a bit better job. So, I searched for a specific id that ended in "*?" and found that grep could only pull it when the "*?" was removed.

Thanks for updating me on this. It makes my life a little easier when I go back to fix things.