SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

"NM_138693.4" json generation error #57

Closed ltnetcase closed 11 months ago

ltnetcase commented 11 months ago

When test for the json generation, I met a warning information which I think is not right:

WARNING:root:Couldn't set coding start/end transcript positions: Coordinate (108) inside insertion (I67) - no mapping possible!

which is from the stderr of json generation process of "GCF_000001405.25_GRCh37.p13_genomic.gff.gz"

Only "NM_138693.4" has this " I67 " Gap description, so I checked the final result from the downloaded new released "cdot-0.2.21.refseq.grch37.json.gz", there's no "start_codon" key in the term "NM_138693.4" data entry, this transcript has 2 non-frameshift insertion compared to the reference genome sequence, the "I67" part is located in 5'utr region, not in coding region, the coordinate "108" is calculated from the distance from reference genome positions which is not correct I think.

Please check the code~

Best Wishes!

davmlaw commented 11 months ago

Hi, thanks for reporting. I am going to write some debug information down as I investigate.

There are 3 files downloaded as GCF_000001405.25_GRCh37.p13_genomic.gff.gz - they are renamed to avoid confusion

refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz
refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.json.gz
refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz

20190906 - Does not contain the transcript "NM_138693.4" 20201022 and 20220307 both have the same GFF contents:

I skipped the "NW_003871065.1" contig alignments

NC_000007.13    BestRefSeq  mRNA    130415525   130418967   .   -   .   ID=rna-NM_138693.4;Parent=gene-KLF14;Dbxref=GeneID:136259,Genbank:NM_138693.4,HGNC:HGNC:23025,MIM:609393;Name=NM_138693.4;Note=The RefSeq transcript has 8 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=KLF14;inference=similar to RNA sequence%2C mRNA (same species):RefSeq:NM_138693.4;product=Kruppel like factor 14;tag=RefSeq Select;transcript_id=NM_138693.4
NC_000007.13    BestRefSeq  exon    130415525   130418967   .   -   .   ID=exon-NM_138693.4-1;Parent=rna-NM_138693.4;Dbxref=GeneID:136259,Genbank:NM_138693.4,HGNC:HGNC:23025,MIM:609393;Note=The RefSeq transcript has 8 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=KLF14;inference=similar to RNA sequence%2C mRNA (same species):RefSeq:NM_138693.4;product=Kruppel like factor 14;tag=RefSeq Select;transcript_id=NM_138693.4
NC_000007.13    BestRefSeq  CDS 130417889   130418860   .   -   0   ID=cds-NP_619638.2;Parent=rna-NM_138693.4;Dbxref=CCDS:CCDS5825.1,GeneID:136259,Genbank:NP_619638.2,HGNC:HGNC:23025,MIM:609393;Name=NP_619638.2;Note=The RefSeq protein has 2 substitutions compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=KLF14;inference=similar to AA sequence (same species):RefSeq:NP_619638.2;product=Krueppel-like factor 14;protein_id=NP_619638.2;tag=RefSeq Select
NC_000007.13    RefSeq  cDNA_match  130415525   130418967   .   -   .   ID=dab4d9fc80dfeff13bef66881b99a66e;Target=NM_138693.4 1 3511 +;consensus_splices=0;exon_identity=0.978354;for_remapping=1;gap_count=2;identity=0.978354;idty=0.978354;matches=3435;num_ident=3435;num_mismatch=8;pct_coverage=98.0632;pct_coverage_hiqual=98.0632;pct_identity_gap=97.8354;pct_identity_ungap=99.7676;product_coverage=1;rank=1;splices=0;Gap=M84 I67 M1386 I1 M1973

The JSON generated is the same for both (except URL)

{'biotype': ['mRNA'],
 'gene_name': 'KLF14',
 'gene_version': '136259',
 'genome_builds': {'GRCh37': {'cds_end': 130418860,
   'cds_start': 130417888,
   'contig': 'NC_000007.13',
   'exons': [[130415524, 130418967, 0, 1, 3511, 'M84 I67 M1386 I1 M1973']],
   'strand': '-',
   'tag': 'RefSeq Select',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz'}},
 'hgnc': '23025',
 'id': 'NM_138693.4',
 'protein': 'NP_619638.2',
 'stop_codon': 1146}

Making the error better to say what is actually doing gives:

WARNING:root:Couldn't set start_codon transcript positions from 130418860: Coordinate (108) inside insertion (I67) - no mapping possible!
ltnetcase commented 11 months ago

Yes, exactly, sorry for the confusion. So have you noticed this as a bug?

davmlaw commented 11 months ago

Sorry had to stop while working before. Yes - it's a bug - we shouldn't die in an insertion looping through cDNA_match for genomic positions, we should just skip it.

I think I copy/pasted the code for deletions, we can just remove this.

Testing, using ClinVar examples

NM_138693.4(KLF14):c.965G>C (p.Cys322Ser) - GRCh37: Chr7:130417896 NM_138693.4(KLF14):c.947C>T - GRCh37: Chr7:130417914 NM_138693.4(KLF14):c.121T>C (p.Ser41Pro) - GRCh37: Chr7:130418740 NM_138693.4(KLF14):c.245A>G (p.His82Arg) - GRCh37: Chr7:130418616

Test code:

import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider

hdp = JSONDataProvider(["/data/annotation/cdot_generation/refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz"])

am = AssemblyMapper(hdp,
                    assembly_name='GRCh37',
                    alt_aln_method='splign', replace_reference=True)

hp = hgvs.parser.Parser()

CLINVAR_TEST_VARIANTS = [
        "NM_138693.4(KLF14):c.965G>C",
        "NM_138693.4(KLF14):c.947C>T",
        "NM_138693.4(KLF14):c.121T>C",
        "NM_138693.4(KLF14):c.245A>G",
]

for hgvs_name in CLINVAR_TEST_VARIANTS:
    var_c = hp.parse_hgvs_variant(hgvs_name)
    var_g = am.c_to_g(var_c)
    print(f"{var_c} => {var_g}")

Output:

NM_138693.4(KLF14):c.965G>C => NC_000007.13:g.130417896C>G
NM_138693.4(KLF14):c.947C>T => NC_000007.13:g.130417914G>A
NM_138693.4(KLF14):c.121T>C => NC_000007.13:g.130418740A>G
NM_138693.4(KLF14):c.245A>G => NC_000007.13:g.130418616T>C
davmlaw commented 11 months ago

OK, fixed in latest code, as this is only a few transcripts I will just wait for other changes before going through a whole release. Thanks for reporting the bug @ltnetcase