fetch_ucsc hg38 kg - Githubissues

BarryDigby commented 3 years ago

You might already be aware of this but using fetch_ucsc.py hg38 kg hg38.txt and subsequently cut -2-11 hg38.txt|genePredToGtf file stdin hg38.gtf appends an extra numeric to the exon ID. Here is a head on the file generated by the code above:

chr1    stdin   transcript  17369   17436   .   -   .   gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; 
chr1    stdin   exon    17369   17436   .   -   .   gene_id "ENST00000619216.1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1";

This returns a python error downstream when using CIRCexplorer2. (Apologies I do not have the exact error message , if memory serves it did mention a numeric array and pointed to the exon_id.)

I have also tried to use gencode reference files for GRCh38 using your suggestion in this post here but found that 0 junction sites were being annotated.

Is there a recommended method for using a GRCh38 reference, preferably not with RefSeq annotations?

kepbod commented 3 years ago

You could download the GTF file directly, and convert the file from GTF to GenePred which is required by CIRCexplorer2. I will use GENCODE annotations as an example. You could download the GTF file (gencode.v34.annotation.gtf) from https://www.gencodegenes.org/human/. Then using below command to convert it from GTF to GenePred:

gtfToGenePred -genePredExt -geneNameAsName2 gencode.v34.annotation.gtf gencode.v34.annotation.genepred
perl -alne '$"="\t";print "@F[11,0..9]"' gencode.v34.annotation.genepred gencode.v34.annotation.txt

BarryDigby commented 3 years ago

Thank you for the added piece of perl code. Looking forward to trying this out

BarryDigby commented 3 years ago

For posterity, here are the commands used to generate the FASTA and GTF / TXT file using gencodes latest GRCh38 files:

        wget --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz
        wget --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.primary_assembly.genome.fa.gz
        gunzip gencode.v34.primary_assembly.annotation.gtf.gz
        gunzip GRCh38.primary_assembly.genome.fa.gz
        mv gencode.v34.primary_assembly.annotation.gtf hg38.gtf
        mv GRCh38.primary_assembly.genome.fa hg38.fa.tmp
        sed 's/\s.*$//' hg38.fa.tmp > hg38.fa
        gtfToGenePred -genePredExt -geneNameAsName2 hg38.gtf hg38.genepred
        perl -alne '$"="\t";print "@F[11,0..9]"' hg38.genepred > hg38.txt

The sed command removes the whitespace and trailing chromosome name present in the FASTA headers (>chr1 1 --> >chr1) so that headers match the chr column of the annotation files. These work with CIRCexplorer2, find_circ.py, CIRIquant, circminer & mapsplice.

YangLab / CIRCexplorer2

fetch_ucsc hg38 kg #43