SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Differing implementation of get_tx_for_region to hgvs one #38

Closed holtgrewe closed 1 year ago

holtgrewe commented 1 year ago

The SQL query is here:

https://github.com/biocommons/hgvs/blob/main/src/hgvs/dataproviders/uta.py#L151

            select tx_ac,alt_ac,alt_strand,alt_aln_method,min(start_i) as start_i,max(end_i) as end_i
            from exon_set ES
            join exon E on ES.exon_set_id=E.exon_set_id 
            where alt_ac=?
            group by tx_ac,alt_ac,alt_strand,alt_aln_method
            having min(start_i) < ? and ? <= max(end_i)

As far as I can see, this translates to the query being between the leftmost coordinate of any exon and the rightmost coordinate of any exon of this transcript.

The code in cdot checks each transcript individually.

                    for exon in build_data["exons"]:
                        if exon[0] < start_i and end_i <= exon[1]:
                            tx_list.append({
                                "alt_ac": alt_ac,
                                "alt_aln_method": self.NCBI_ALN_METHOD,
                                "alt_strand": strand,
                                "start_i": tx_start,
                                "end_i": tx_end,
                                "tx_ac": transcript_id,
                            })
                            break
davmlaw commented 1 year ago

Hi, thanks for reporting the issue, though I don't think I fully understand what is wrong.

Are you able to provide an example test case, ie values for alt_ac, start_i, end_i then describe what results you expect vs actual results returned? Thanks

holtgrewe commented 1 year ago

Take any intronic position. The hgvs / UTA code will return the transcript, the cdot code will not.

davmlaw commented 1 year ago

Hi, I've made the fix and released an update - cdot v.0.2.14

Can you please give it a test, and if there's still a problem please re-open the issue. Thanks!