Shicheng-Guo / rbiotools

Other
0 stars 0 forks source link

ensg, enst, symbol, gtf, rtracklayer and length is different #14

Open Shicheng-Guo opened 3 years ago

Shicheng-Guo commented 3 years ago

The problem is chr and regions have different length, how to solve the problem?

# Using GTF files to extract information about genes, transcripts and related features
http://ceesu.github.io/gtf/
BiocManager::install("rtracklayer")

curl ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz -o Mus_musculus.GRCm38.94.gtf.gz
curl ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.gtf.gz -o Homo_sapiens.GRCh38.94.gtf.gz

curl -O ftp://ftp.ensembl.org/pub/release-99/variation/indexed_vep_cache/homo_sapiens_vep_99_GRCh38.tar.gz
tar xzf homo_sapiens_vep_99_GRCh38.tar.gz
curl -O ftp://ftp.ensembl.org/pub/release-99/variation/indexed_vep_cache/homo_sapiens_vep_99_GRCh37.tar.gz
tar xzf homo_sapiens_vep_99_GRCh38.tar.gz

gtf <- rtracklayer::import('Homo_sapiens.GRCh38.103.gtf')
input <- gtf[gtf$type == "gene",]
seqnames<-gtf$seqnames
ranges<-gtf$ranges
gene_name<-gtf$gene_name
gene_id<-gtf$gene_id
transcript_id<-gtf$transcript_id
strand<-gtf$strand
length(seqnames)
length(ranges)
length(gene_name)
length(gene_id)
length(transcript_id)
length(strand)
input<-data.frame(gene_name,gene_id,transcript_id)
dim(input)
input<-na.omit(input)
dim(input)
write.table(input,file="human.symbol.ensg.enst.txt",sep="\t",quote=F,row.names=F,col.names=T)