GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Some more minor bugs #7

Closed jeremyButtler closed 3 months ago

jeremyButtler commented 3 months ago

Thanks for posting you catalog on git hub so we could let you know about bugs. That way if you ever decided to do a next edition you can think about bugs from the previous edition.

One thing I found is that a few of the variant have ids that are amino acid ids (gene_p.), but have sequences and positions outside of the gene reading frames. So, there might be a big deletion for a frame shift or an extra base added on to the end of a stop codon that is not part of the variant ids gene. There is a change there, but there is also some noise. I found these to be a bit hard to process.

Here is a table of the variants I know of that have this issue.

Variant id Number times effect
ethA_p.His281fs 1 frame shift
ethA_p.Ile325fs 1 frame shift
ethA_p.Leu295fs 1 frame shift
ethA_p.Thr353fs 1 frame shift
ethA_p.Val489fs 1 frame shift
pncA_p.Ter187Argext*? 1 removes stop
pncA_p.Thr61fs 1 frame shift
tlyA_p.Met1? 14 removes start

Again sorry about being the noisy person. I am using the catalog in my projects and started to notice things that break my programs. I have this issue dealt with, but figured I would make sure you knew that this was happening. That way you can be aware of it when you build the next catalog, if you are planning on it.

Thanks again for this resource.

jeremyButtler commented 3 months ago

Thanks for your reply on my previous comment. It looks like the pdf answered my question. Still it seems odd to me that these genomic coordinates were not included for grade 3, 4, and 5 variants, but also included a reference. Maybe in the next edition just have a tag saying no genome indices?

sachalau commented 3 months ago

Dear Jeremy,

Thank you for your feedback. I'm sorry if I'm misunderstanding your various (independent) points, but I'll still to try to complement.

Still it seems odd to me that these genomic coordinates were not included for grade 3, 4, and 5 variants,

You should find that most grade 3-5 variants are actually included in the Genomic_coordinates. All variants that are in Catalogue_master_file are present in Genomic_coordinates, excepting all deletions and some unseen LoF. I think you are misinterpreting that sentence :

We do not provide genomic-variants for LoF graded-variant that are never classified as group 1 or 2 for any drug or that are not subject to an epistatis rule

The reason we made that choice is because there is no actionable reporting to be made associated with those variants (unseen LoFs that are not associated with 1-2 grading or an epistatis rule), so we did not want to add unnecessary entries to an already lengthy catalogue.

As to variants that are falling outside of gene boundaries, those are correct, for instance in the case of deletions that overlap gene sequences. Our annotation tool (SnpEff) predicted that these still have an effect on the protein if it's still expressed (a frameshift, or the loss of a stop codon, etc).

jeremyButtler commented 3 months ago

Thanks for letting me know this. It helps out in my understanding a lot.