Open Rohit-Satyam opened 6 months ago
A bit faster code. Used chatgpt to fasten the GC content calculation.
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf"
FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))
#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)
#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
#Create a list of the ensembl_id/GC/length
library(GenomicRanges) # Assuming this is the type of object being used
calc_GC_length_optimized <- function(reducedGTF) {
# Extract metadata once
metadata <- elementMetadata(reducedGTF)
# Calculate total nGCs and widths for each gene directly without splitting
# This assumes 'gene_id' is present in the metadata
total_nGCs <- tapply(metadata$nGCs, metadata$gene_id, sum)
total_widths <- tapply(metadata$widths, metadata$gene_id, sum)
# Calculate GC content for each gene and combine with widths
gc_content <- total_nGCs / total_widths
# Combine results into a matrix or data.frame
result <- data.frame(width = total_widths, GC_content = gc_content)
row.names(result) <- names(total_widths) # Assuming gene_id should be row names
return(result)
}
# Assuming reducedGTF is already defined and is a GRanges object
output_optimized <- calc_GC_length_optimized(reducedGTF)
I was going through your code for calculating feature length but I see you use
exon
rather than thegene
field of the gtf. I was trying to use this code for NOISeq so was wondering which is the right way. I seegene
would be more appropriate, wouldn't it?https://github.com/dpryan79/Answers/blob/d2d7366fec907fd92ccf03e77c2d53cfe45b02c9/SEQanswers_42420/GTF2LengthGC.R#L10
Besides
elementNROWS
is now used instead ofelementLengths
. I understand what you wrote was almost a decade ago but people refer to your post regularly as they are good and reliable.