Ensembl / tark

Apache License 2.0
4 stars 3 forks source link

Transcript/Genome alignments (with gaps) - Necessary for RefSeq HGVS c./g. conversion #81

Open davmlaw opened 2 months ago

davmlaw commented 2 months ago

Hi, this project looks good! Thanks!

I would like to use Tark as a source of transcripts for Biocommons HGVS Python library

RefSeq transcripts can differ from the genome sequence, so can align to the genome build with indels

For instance NM_001205122.2 (ATG13) aligning to GRCh38 has a 2bp deletion in exon 15 (alignment is 509bp match, 2 bp deletion, 1753bp match).This is critical to know when converting between genomic (g.) and c. HGVS so you can adjust for these gaps

I have already done so in my own project -cdot - which reads RefSeq/Ensembl GFF/GTF files, ideally I would like to stop maintaining this myself and move over to Tark

Eg: https://cdot.cc/transcript/NM_001205122.2 has this alignment info (in Biocommons HGVS style)

[46672254, 46674518, 14, 1635, 3896, "M509 D2 M1753"]

As far as I can see, Tark doesn't have this yet:

https://tark.ensembl.org/api/transcript/?stable_id=NM_001205122&stable_id_version=2&expand_all=true

                {
                    "exon_id": 73193759,
                    "stable_id": "exon-NR_144423.2-19",
                    "stable_id_version": 1,
                    "assembly": "GRCh38",
                    "loc_start": 46672255,
                    "loc_end": 46674518,
                    "loc_strand": 1,
                    "loc_region": "11",
                    "loc_checksum": "F44BD3F6F8F8764182282A78AE315772F78ECCF8",
                    "exon_checksum": "55D9C6A38CC3510856809E31ED688BB19C01786A",
                    "exon_order": 15
                }

Could you please add these alignment strings to RefSeq transcript exons? Knowing mismatches would also be beneficial

I hope to write a JSON client for HGVS, that will only be enabled for Ensembl to start with. Thanks!

davmlaw commented 1 month ago

Hi, I've made an initial implementation of the biocommons HGVS TARK loader - review/comments would be very helpful!

I check the TARK sequence and compare it to the sequence from pasting together genome exomes, if different, I say we don't support that transcript / genome alignment so we at least don't get it wrong