counsyl / hgvs

HGVS variant name parsing and generation
MIT License
171 stars 79 forks source link

how to create or find "genes.refGene" file for hg19 and hg38 #39

Open anopperl opened 7 years ago

anopperl commented 7 years ago

how to create or find "genes.refGene" file for hg19, hg38. i have got "genes.refGene" file from USSC but these are not working for my case

error shows :

Traceback (most recent call last): File "first_py.py", line 38, in hgvs_name, genome, get_transcript=get_transcript) File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name ValueError: transcript is required

naegelyd commented 7 years ago

@anopperl Could you post the code you are trying to run that caused the error shown above? Might help in coming up with a solution. Thanks.

lacek commented 7 years ago

There is already a genes.refGene in the directory pyhgvs/data of this repository. It is simply old but working.

"genes.refGene" is not directly available at UCSC. However, you should be able to created one with latest tables from UCSC database. Refer #26 for an example for hg19. For hg38, you simply need to replace hg19 by hg38 in the example command. (Or download the refGene.txt.gz from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz)

wxc0218 commented 5 years ago

I used the following command to get the "genes.refGene" file, run the Example usage, still still have the above error 。 mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg19 -BNe "SELECT r.bin,CONCAT(r.name, '.', g.version) AS name,r.chrom,r.strand,r.txStart,r.txEnd,r.cdsStart,r.cdsEnd,r.exonCount,r.exonStarts,r.exonEnds,r.score,r.name2,r.cdsStartStat,r.cdsEndStat,r.exonFrames FROM hg19.refGene r, hgFixed.gbCdnaInfo g WHERE r.name = g.acc" > genes.refGene

davmlaw commented 5 years ago

This is a dupe, see other issue for scripts: https://github.com/counsyl/hgvs/issues/26#issuecomment-523477167

davmlaw commented 2 years ago

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)
GuoFengWang commented 1 year ago

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)

Hi,@davmlaw, I want to get HGVS cdot from REF/ALT using pyhgvs(not parse hgvs string), how could I use the cdot package to help me? Could you show me some example of scripts? Thanks

davmlaw commented 1 year ago

First, install cdot:

python3 -m pip install cdot

Then get the cdot data:

wget https://github.com/SACGF/cdot/releases/download/v0.2.12/cdot-0.2.12.refseq.grch37_grch38.json.gz

This is based off the pyhgvs README - basically we need to load cdot data files to populate the pyhgvs "transcript" record (they wrote their own get_transcript() method in the README)

import pyhgvs as hgvs
from pysam.libcfaidx import FastaFile
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory

factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.12.refseq.grch37_grch38.json.gz"])
transcript = factory.get_transcript_grch37("NM_000352.3")  # change to 38 if you want that build
chrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')
genome = FastaFile("/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz")
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
GuoFengWang commented 1 year ago

@davmlaw Thank you very much! And I wonder is there any way to get all the relevant transcripts based on chrom and offset?

davmlaw commented 1 year ago

The easiest way to do that is probably using bedtools on a GTF then pulling out transcripts from the cut down GTF

Another way would be to use another Python library for doing genomic type stuff that I wrote called pyreference untested code would be:

my_region_iv = GenomicInterval("chr1", 100, 10000)
for iv, transcript_set in reference.genomic_transcripts[my_region_iv]
    pass
GuoFengWang commented 1 year ago

@davmlaw Hi, your work is great! I used your package and pyhgvs to generate HGVS name. Then, I used another package: hgvs to parse the variant, but it raised HGVSParserError:raise HGVSParseError("{s}: char {exc.position}: {reason}".format( hgvs.exceptions.HGVSParseError: NM_001286350.2(DACT2):c.681_692dup12: char 35: expected EOF . Why hgvs can not parse the result which generated by pyhgvs? How could I solve it?

davmlaw commented 1 year ago

Hi, HGVS is very broad spec and each library only parses a subset of it.

I think the issue is the 12 at the end (this is redundant, and you could work it out from the difference between positions) you need to regenerate them without the numbers or strip them off