dpryan79 / Answers

Random code produced to answer questions on biostars/seqanswers/etc.
8 stars 8 forks source link

Gene length or exon length? #1

Open Rohit-Satyam opened 6 months ago

Rohit-Satyam commented 6 months ago

I was going through your code for calculating feature length but I see you use exon rather than the gene field of the gtf. I was trying to use this code for NOISeq so was wondering which is the right way. I see gene would be more appropriate, wouldn't it?

https://github.com/dpryan79/Answers/blob/d2d7366fec907fd92ccf03e77c2d53cfe45b02c9/SEQanswers_42420/GTF2LengthGC.R#L10

Besides elementNROWS is now used instead of elementLengths. I understand what you wrote was almost a decade ago but people refer to your post regularly as they are good and reliable.

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf"
FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa"

GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
  nGCs = sum(elementMetadata(x)$nGCs)
  width = sum(elementMetadata(x)$widths)
  c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")
Rohit-Satyam commented 6 months ago

A bit faster code. Used chatgpt to fasten the GC content calculation.

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf"
FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa"

GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
library(GenomicRanges) # Assuming this is the type of object being used

calc_GC_length_optimized <- function(reducedGTF) {
  # Extract metadata once
  metadata <- elementMetadata(reducedGTF)

  # Calculate total nGCs and widths for each gene directly without splitting
  # This assumes 'gene_id' is present in the metadata
  total_nGCs <- tapply(metadata$nGCs, metadata$gene_id, sum)
  total_widths <- tapply(metadata$widths, metadata$gene_id, sum)

  # Calculate GC content for each gene and combine with widths
  gc_content <- total_nGCs / total_widths

  # Combine results into a matrix or data.frame
  result <- data.frame(width = total_widths, GC_content = gc_content)
  row.names(result) <- names(total_widths) # Assuming gene_id should be row names

  return(result)
}

# Assuming reducedGTF is already defined and is a GRanges object
output_optimized <- calc_GC_length_optimized(reducedGTF)