Closed BarryDigby closed 3 years ago
You could download the GTF file directly, and convert the file from GTF to GenePred which is required by CIRCexplorer2. I will use GENCODE annotations as an example. You could download the GTF file (gencode.v34.annotation.gtf) from https://www.gencodegenes.org/human/. Then using below command to convert it from GTF to GenePred:
gtfToGenePred -genePredExt -geneNameAsName2 gencode.v34.annotation.gtf gencode.v34.annotation.genepred
perl -alne '$"="\t";print "@F[11,0..9]"' gencode.v34.annotation.genepred gencode.v34.annotation.txt
Thank you for the added piece of perl code. Looking forward to trying this out
For posterity, here are the commands used to generate the FASTA and GTF / TXT file using gencodes latest GRCh38 files:
wget --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz
wget --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v34.primary_assembly.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
mv gencode.v34.primary_assembly.annotation.gtf hg38.gtf
mv GRCh38.primary_assembly.genome.fa hg38.fa.tmp
sed 's/\s.*$//' hg38.fa.tmp > hg38.fa
gtfToGenePred -genePredExt -geneNameAsName2 hg38.gtf hg38.genepred
perl -alne '$"="\t";print "@F[11,0..9]"' hg38.genepred > hg38.txt
The sed
command removes the whitespace and trailing chromosome name present in the FASTA headers (>chr1 1
--> >chr1
) so that headers match the chr
column of the annotation files. These work with CIRCexplorer2
, find_circ.py
, CIRIquant
, circminer
& mapsplice
.
You might already be aware of this but using
fetch_ucsc.py hg38 kg hg38.txt
and subsequentlycut -2-11 hg38.txt|genePredToGtf file stdin hg38.gtf
appends an extra numeric to the exon ID. Here is a head on the file generated by the code above:This returns a python error downstream when using CIRCexplorer2. (Apologies I do not have the exact error message , if memory serves it did mention a numeric array and pointed to the exon_id.)
I have also tried to use gencode reference files for GRCh38 using your suggestion in this post here but found that 0 junction sites were being annotated.
Is there a recommended method for using a GRCh38 reference, preferably not with RefSeq annotations?