biocommons / uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Apache License 2.0
62 stars 26 forks source link

Public UTA instance is missing NM_020975.5, which differs from NM_020975.4:c.2307_2309GCG>TCA. #215

Closed jfreidin closed 5 years ago

jfreidin commented 5 years ago

My local Docker instance of UTA has ['NM_020975.4', 'NM_020975.5'], but the public UTA instance, on which I'm relying for an instance without Docker has only ['NM_020975.4']. This causes a difference in behavior for c.2307_2309GCG>TCA. In NM_020975.4, c.2307 is apparently a T, but in both NM_020975.4 and NC_000010.10 it's a G. So on my local instance, I get

NC_000010.10:  CGAGTGAGCTGCGAGACCTGCTG
delins:                  TCA          

whereas using the public instance:

hgvs.exceptions.HGVSInvalidVariantError: NM_020975.4:c.2307_2309delinsTCA: Variant reference (GCG) does not agree with reference sequence (TCG)

I suspect all that needs to happen is updating the public UTA server with the latest data?

reece commented 5 years ago

@jfreidin What instance name?

jfreidin commented 5 years ago
INFO:biocommons.seqrepo:biocommons.seqrepo 0.4.4
INFO:hgvs.dataproviders.seqfetcher:Using SeqRepo(/compbio_res/monitoring/seqrepo/2018-11-26) sequence fetching
INFO:hgvs.dataproviders.uta:connected to postgresql://anonymous:anonymous@uta.biocommons.org/uta/uta_20161216...
jfreidin commented 5 years ago
In [6]: hv.validate(hp.parse_hgvs_variant('NM_020975.4:c.2307_2309delGCGinsTCA'))
...
HGVSInvalidVariantError: NM_020975.4:c.2307_2309delinsTCA: Variant reference (GCG) does not agree with reference sequence (TCG)

In [7]: hv.validate(hp.parse_hgvs_variant('NM_020975.5:c.2307_2309delGCGinsTCA'))
...
HGVSDataNotAvailableError: No transcript definition for (tx_ac=NM_020975.5)
reece commented 5 years ago

@jfreidin I should have updated the defaults long ago. The next release will use a newer UTA (uta_20171026).

I don't understand how the content is different between the public instance and the docker instance. The workflow has always been to build the public instances from exactly the same snapshots that were used. I also never made ad hoc changes to a database after deployment. (As it turns out, I learned recently that this was done on uta_20180821. That instance is broken. See biocommons/hgvs#537.)

jfreidin commented 5 years ago

@reece Thank you for clarifying the environmental difference. My local Docker instance is: UTA_DB_URL=postgresql://anonymous@localhost:15032/uta/uta_20171026 When I switched from the default URL to UTA_DB_URL=postgresql://anonymous:anonymous@uta.biocommons.org/uta/uta_20171026 they behave the same:

In [2]: hv.validate(hp.parse_hgvs_variant('NM_020975.5:c.2307_2309delGCGinsTCA'))
INFO:biocommons.seqrepo.fastadir.fastadir:Opening for reading: /compbio_res/monitoring/seqrepo/2018-11-26/sequences/2017/1026/2234/1509057245.87.fa.bgz
Out[2]: True
reece commented 5 years ago

Whew! I like reproducibility and thought that I'd blown it somewhere!