SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Ensembl Tark Data Provider #86

Open davmlaw opened 2 weeks ago

davmlaw commented 2 weeks ago

Andy Yates suggested https://tark.ensembl.org/

This has Ensembl in a format we can use

However, it doesn't have alignments (CIGAR etc) for RefSeq so doesn't handle gaps, have raised issue on project Ensembl/tark#81

So I think we should just do Ensembl to start with


Example:

http://tark.ensembl.org/api/transcript/?stable_id=ENST00000256078&stable_id_version=4&expand_all=true

We can get sequence out via:

data["results"][0]["sequence"]["sequence"]

Can get out protein - get_pro_ac_for_tx_ac:

t = data["results"][0]["translations"][0]
In [17]: f'{t["stable_id"]}.{t["stable_id_version"]}'
Out[17]: 'ENSP00000256078.4'

Can implement ``get_tx_for_gene```

http://tark.ensembl.org/api/transcript/search/?identifier_field=KRAS&expand=transcript_release_set%2Cgenes

Can even implement get_tx_for_region via eg:

http://tark.ensembl.org/api/transcript/?loc_start=25362365&loc_end=25403737&loc_region=12&expand_all=false

davmlaw commented 1 week ago

working in branch ensembl_tark

davmlaw commented 3 days ago

ok merged into main. Need to start trying with a test set for a while

Also need to disable RefSeq due to the gap problem

Also need a test for _get_most_recent_release_date