Closed ltnetcase closed 11 months ago
Hi, thanks for reporting. I am going to write some debug information down as I investigate.
There are 3 files downloaded as GCF_000001405.25_GRCh37.p13_genomic.gff.gz - they are renamed to avoid confusion
refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz
refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.json.gz
refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz
20190906 - Does not contain the transcript "NM_138693.4" 20201022 and 20220307 both have the same GFF contents:
I skipped the "NW_003871065.1" contig alignments
NC_000007.13 BestRefSeq mRNA 130415525 130418967 . - . ID=rna-NM_138693.4;Parent=gene-KLF14;Dbxref=GeneID:136259,Genbank:NM_138693.4,HGNC:HGNC:23025,MIM:609393;Name=NM_138693.4;Note=The RefSeq transcript has 8 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=KLF14;inference=similar to RNA sequence%2C mRNA (same species):RefSeq:NM_138693.4;product=Kruppel like factor 14;tag=RefSeq Select;transcript_id=NM_138693.4
NC_000007.13 BestRefSeq exon 130415525 130418967 . - . ID=exon-NM_138693.4-1;Parent=rna-NM_138693.4;Dbxref=GeneID:136259,Genbank:NM_138693.4,HGNC:HGNC:23025,MIM:609393;Note=The RefSeq transcript has 8 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=KLF14;inference=similar to RNA sequence%2C mRNA (same species):RefSeq:NM_138693.4;product=Kruppel like factor 14;tag=RefSeq Select;transcript_id=NM_138693.4
NC_000007.13 BestRefSeq CDS 130417889 130418860 . - 0 ID=cds-NP_619638.2;Parent=rna-NM_138693.4;Dbxref=CCDS:CCDS5825.1,GeneID:136259,Genbank:NP_619638.2,HGNC:HGNC:23025,MIM:609393;Name=NP_619638.2;Note=The RefSeq protein has 2 substitutions compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=KLF14;inference=similar to AA sequence (same species):RefSeq:NP_619638.2;product=Krueppel-like factor 14;protein_id=NP_619638.2;tag=RefSeq Select
NC_000007.13 RefSeq cDNA_match 130415525 130418967 . - . ID=dab4d9fc80dfeff13bef66881b99a66e;Target=NM_138693.4 1 3511 +;consensus_splices=0;exon_identity=0.978354;for_remapping=1;gap_count=2;identity=0.978354;idty=0.978354;matches=3435;num_ident=3435;num_mismatch=8;pct_coverage=98.0632;pct_coverage_hiqual=98.0632;pct_identity_gap=97.8354;pct_identity_ungap=99.7676;product_coverage=1;rank=1;splices=0;Gap=M84 I67 M1386 I1 M1973
The JSON generated is the same for both (except URL)
{'biotype': ['mRNA'],
'gene_name': 'KLF14',
'gene_version': '136259',
'genome_builds': {'GRCh37': {'cds_end': 130418860,
'cds_start': 130417888,
'contig': 'NC_000007.13',
'exons': [[130415524, 130418967, 0, 1, 3511, 'M84 I67 M1386 I1 M1973']],
'strand': '-',
'tag': 'RefSeq Select',
'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz'}},
'hgnc': '23025',
'id': 'NM_138693.4',
'protein': 'NP_619638.2',
'stop_codon': 1146}
Making the error better to say what is actually doing gives:
WARNING:root:Couldn't set start_codon transcript positions from 130418860: Coordinate (108) inside insertion (I67) - no mapping possible!
Yes, exactly, sorry for the confusion. So have you noticed this as a bug?
Sorry had to stop while working before. Yes - it's a bug - we shouldn't die in an insertion looping through cDNA_match for genomic positions, we should just skip it.
I think I copy/pasted the code for deletions, we can just remove this.
Testing, using ClinVar examples
NM_138693.4(KLF14):c.965G>C (p.Cys322Ser) - GRCh37: Chr7:130417896 NM_138693.4(KLF14):c.947C>T - GRCh37: Chr7:130417914 NM_138693.4(KLF14):c.121T>C (p.Ser41Pro) - GRCh37: Chr7:130418740 NM_138693.4(KLF14):c.245A>G (p.His82Arg) - GRCh37: Chr7:130418616
Test code:
import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider
hdp = JSONDataProvider(["/data/annotation/cdot_generation/refseq/GRCh37/cdot-0.2.21.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz"])
am = AssemblyMapper(hdp,
assembly_name='GRCh37',
alt_aln_method='splign', replace_reference=True)
hp = hgvs.parser.Parser()
CLINVAR_TEST_VARIANTS = [
"NM_138693.4(KLF14):c.965G>C",
"NM_138693.4(KLF14):c.947C>T",
"NM_138693.4(KLF14):c.121T>C",
"NM_138693.4(KLF14):c.245A>G",
]
for hgvs_name in CLINVAR_TEST_VARIANTS:
var_c = hp.parse_hgvs_variant(hgvs_name)
var_g = am.c_to_g(var_c)
print(f"{var_c} => {var_g}")
Output:
NM_138693.4(KLF14):c.965G>C => NC_000007.13:g.130417896C>G
NM_138693.4(KLF14):c.947C>T => NC_000007.13:g.130417914G>A
NM_138693.4(KLF14):c.121T>C => NC_000007.13:g.130418740A>G
NM_138693.4(KLF14):c.245A>G => NC_000007.13:g.130418616T>C
OK, fixed in latest code, as this is only a few transcripts I will just wait for other changes before going through a whole release. Thanks for reporting the bug @ltnetcase
When test for the json generation, I met a warning information which I think is not right:
which is from the stderr of json generation process of "GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
Only "NM_138693.4" has this " I67 " Gap description, so I checked the final result from the downloaded new released "cdot-0.2.21.refseq.grch37.json.gz", there's no "start_codon" key in the term "NM_138693.4" data entry, this transcript has 2 non-frameshift insertion compared to the reference genome sequence, the "I67" part is located in 5'utr region, not in coding region, the coordinate "108" is calculated from the distance from reference genome positions which is not correct I think.
Please check the code~
Best Wishes!