biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
240 stars 94 forks source link

Non-human genomes in HGVS/UTA #598

Closed sachalau closed 4 months ago

sachalau commented 4 years ago

Hello @reece,

Following issue #569, I was wondering whether you had any news regarding the work of incorporating other genome references into UTA then using it for HGVS Validation with this package.

At the moment I'm using snpEff to annotate my variants and HGVS for Parse() and IntrinsicValidation(). However I'm wondering whether I could use UTA/HGVS directly because for some variant snpEff gives incorrect results.

Regarding my particular bacterial genome of interest, one thing that is different from the annotation is that it does not have transcript data. In the genbank/gff you go directly from "gene" with a "locustag" as accession to a CDS for protein with "NP" accessions. For non coding genes, the accession for the gene and for the transcript ("rRNA", "tRNA" or "ncRNA" in the genbank/gff) is identical.

I guess one trick would be to use the same accession for transcript and gene as intermediate between gene and CDS. It would be a good enough approximation of the reality for prokaryotic genomes.

The alignment with spalign would then be trivial but I suspect that UTA needs the spalign output files for loading so it might still need to be performed. However is that the Accession for the transcript will not be formatted as "NM_" but as LocusTag, which theoritically won't be allowed by HGVS, although LocusTags are stable.

The last point is that although human and bacterial genetic code is almost equivalent, there are more start codons allowed in bacterial genomes. Compare for humans :

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M------**--*----M---------------M----------------------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

and bacterias :

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M------**--*----M------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Bacterias have 3 additional start codons.

To sum up, for using bacterial genome with UTA/HGVS I would need to :

  1. Incorporate genome/transcript/protein data into UTA
  2. Modify my transcript accession to force compatibility with HGVS (which could be something as awful/ugly as prefixing with "NM_" tags and suffixing with fake version numbers...)
  3. Handle the additional start codons.

My estimation is that 1) and 2) would not be too hard (could I use documentation provided here to load data https://pythonhosted.org/uta/db_loading.html ?) but 3) might be because I suspect that the way start codons are handled is coded somewhere in hgvs or in bioutils.sequences.

Does that make sense for you?

Thanks for any help you can provide and all the good work already with python HGVS that is very valuable.

(I hope my message still make sense after the multiple edits)

reece commented 4 years ago

Hi @sachalau : Your request was very clear. Thank you for raising this issue so thoughtfully!

You've analyzed the problem well. These are the barriers:

Finally, let me say that loading UTA is not easy. I desperately want to overhaul and generalize it in order to support more species and custom builds, but there's no financial support for that (yet).

sachalau commented 4 years ago

Thanks for your answer. Here is an example I have been working on, just a gene of my organism of interest, defined in gbk. The reference is NC_000962.3

     gene            1673440..1674183
                     /gene="fabG1"
                     /locus_tag="Rv1483"
                     /gene_synonym="mabA"
                     /db_xref="GeneID:886551"
     CDS             1673440..1674183
                     /gene="fabG1"
                     /locus_tag="Rv1483"
                     /gene_synonym="mabA"
                     /experiment="COORDINATES:Mass spectrometry[PMID:21969609]"
                     /experiment="DESCRIPTION:Gene expression, mutation
                     analysis[PMID:19685079]"
                     /experiment="DESCRIPTION:Protein purification and
                     characterization[PMID:11932442]"
                     /experiment="EXISTENCE:Mass spectrometry[PMID:20429878]"
                     /experiment="EXISTENCE:Mass spectrometry[PMID:21920479]"
                     /inference="protein motif:PROSITE:PS00061"
                     /note="3-ketoacyl-acyl carrier protein reductase (mycolic
                     acid biosynthesis a protein)"
                     /codon_start=1
                     /transl_table=11
                     /product="3-oxoacyl-ACP reductase FabG"
                     /protein_id="NP_215999.1"
                     /db_xref="GeneID:886551"
                     /translation="MTATATEGAKPPFVSRSVLVTGGNRGIGLAIAQRLAADGHKVAV
                     THRGSGAPKGLFGVECDVTDSDAVDRAFTAVEEHQGPVEVLVSNAGLSADAFLMRMTE
                     EKFEKVINANLTGAFRVAQRASRSMQRNKFGRMIFIGSVSGSWGIGNQANYAASKAGV
                     IGMARSIARELSKANVTANVVAPGYIDTDMTRALDERIQQGALQFIPAKRVGTPAEVA
                     GVVSFLASEDASYISGAVIPVDGGMGMGH"

So I created a test/ folder in loading/data/ with two very simple files :

exonset.gz:

tx_ac   alt_ac  method  strand  exons_se_i
Rv1483  NC_000962.3 splign-manual   1   1673439,1674183

and txinfo.gz:

origin  ac  hgnc    cds_se_i    exons_se_i
NCBI    Rv1483  fabG1   0,744       0,744;

However I'm a bit stuck for loading into the DB now because I can't install the utils scripts "uta" for loading into the database. So the make load-XXX fail. I think there is some python2/3 compatibility problems that I can't resolve.

Then before going to the codon translation problem, I think I need a couple of other information to understand what needs to be done for bacteria, given that the only difference is for start codons.

  1. Are there any checks in bioutils/hgvs or uta that verify the identify of the first codon in the reference sequences?
  2. Where is the annotation of start codon handled in hgvs?

For comparison, in snpEff, as you give the full set of start codons, all mutations in the first codon are given by p.(Met1?) although some could be defined as start_lost in the sequence ontology whereas other as start_retrained_variants http://sequenceontology.org/browser/current_svn/term/SO:0002019

I realize that HGVS probably never had to care for these cases because humans only have one start codon so my problem regarding genetic code at the moment is irrelevant...

reece commented 4 years ago

Hi @sachalau: Please try pulling UTA again. @andreasprlic and I have been loading new data and came across similar issues. We're using Python 3.7 and I think HEAD should now work.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 9 months ago

This issue was closed because it has been stalled for 7 days with no activity.

reece commented 7 months ago

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been stalled for 7 days with no activity.