Gene length or exon length?

dpryan79 / Answers

Random code produced to answer questions on biostars/seqanswers/etc.

8 stars 8 forks source link

I was going through your code for calculating feature length but I see you use exon rather than the gene field of the gtf. I was trying to use this code for NOISeq so was wondering which is the right way. I see gene would be more appropriate, wouldn't it?

https://github.com/dpryan79/Answers/blob/d2d7366fec907fd92ccf03e77c2d53cfe45b02c9/SEQanswers_42420/GTF2LengthGC.R#L10

Besides elementNROWS is now used instead of elementLengths. I understand what you wrote was almost a decade ago but people refer to your post regularly as they are good and reliable.

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf"
FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa"

GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
  nGCs = sum(elementMetadata(x)$nGCs)
  width = sum(elementMetadata(x)$widths)
  c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

library(GenomicRanges) library(rtracklayer) library(Rsamtools) GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf" FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa" GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene") grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id)) reducedGTF <- unlist(grl, use.names=T) elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl)) #Open the fasta file FASTA <- FaFile(FASTAfile) open(FASTA) #Add the GC numbers elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1] elementMetadata(reducedGTF)$widths <- width(reducedGTF) #Create a list of the ensembl_id/GC/length library(GenomicRanges) # Assuming this is the type of object being used calc_GC_length_optimized <- function(reducedGTF) { # Extract metadata once metadata <- elementMetadata(reducedGTF) # Calculate total nGCs and widths for each gene directly without splitting # This assumes 'gene_id' is present in the metadata total_nGCs <- tapply(metadata$nGCs, metadata$gene_id, sum) total_widths <- tapply(metadata$widths, metadata$gene_id, sum) # Calculate GC content for each gene and combine with widths gc_content <- total_nGCs / total_widths # Combine results into a matrix or data.frame result <- data.frame(width = total_widths, GC_content = gc_content) row.names(result) <- names(total_widths) # Assuming gene_id should be row names return(result) } # Assuming reducedGTF is already defined and is a GRanges object output_optimized <- calc_GC_length_optimized(reducedGTF)

dpryan79 / Answers

Gene length or exon length? #1