Fetch feature lengths info using biomaRt

gtollefson commented 5 years ago

I'd like to use your package to calculate FPKM values from my ht-seq counts but am having trouble getting started.

I would like to prepare the featureLength vector using BiomaRt but don't see the documentation to do this anywhere on the BiomaRt manual. Is this metric called something else? Can you link me to the description of how to prepare this?

AAlhendi1707 commented 5 years ago

Hi there,

Thanks for your question, In the example below, I use the same data set that I added as extdata on countToFPKM to fetch feature lengths info using biomaRt.

library("countToFPKM")
library("biomaRt")
library("dplyr")

## Import feature counts matrix
file.readcounts <- system.file("extdata", "RNA-seq.read.counts.csv", package="countToFPKM")
counts <- as.matrix(read.csv(file.readcounts))

## Build a biomart query 
# In the example below, I use the human gene annotation from Ensembl release 82 located on "sep2015.archive.ensembl.org"
# More about the ensembl_build can be found on "http://www.ensembl.org/info/website/archives/index.html"

ens_build = "sep2015"
dataset="hsapiens_gene_ensembl"
mart = biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset = dataset, 
host = paste0(ens_build, ".archive.ensembl.org"), path = "/biomart/martservice", archive = FALSE)

gene.annotations <- biomaRt::getBM(mart = mart, attributes=c("ensembl_gene_id", "external_gene_name", "start_position", "end_position"))
gene.annotations <- dplyr::transmute(gene.annotations, external_gene_name,  ensembl_gene_id, length = end_position - start_position)

# Filter and re-order gene.annotations to match the order in feature counts matrix
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))
gene.annotations <- gene.annotations[order(match(gene.annotations$ensembl_gene_id, rownames(counts))),]

# Assign feature lenghts into a numeric vector.
featureLength <- gene.annotations$length

A good tutoiral about using biomart can be found on DAVE TANG'S BLOG

Hope it helps!

excel9 commented 4 years ago

Hi, thank you for this step-by step process!

My featurecounts output has UCSC/Refseq gene ids? Is there a way to convert the code to incorporate that?

Thank you!

AAlhendi1707 commented 4 years ago

Hi excel9

You can use UCSC table browser to download list of known genes with their annotations (gene_id, gene_symbol, start, end), as tab-delimited file. Then you can try:

yourfeaturecounts <- as.matrix(read.csv("yourfeatures.counts.csv"))

ucsc.knowngenes <- read.table(file="ucsc.known.genes.txt", header=T, sep="\t")
ucsc.knowngenes <- dplyr::transmute(ucsc.knowngenes, gene_id,  gene_symbol, length = end - start)

ucsc.knowngenes <- ucsc.knowngenes %>% dplyr::filter(gene_id %in% rownames(counts))
ucsc.knowngenes <- ucsc.knowngenes[order(match(ucsc.knowngenes$gene_id, rownames(counts))),]

featureLength <- ucsc.knowngenes$length

excel9 commented 4 years ago

Hi Ahmed, Thank you very much for your help! I however downloaded the Ensemble mm10 genome and created a counts file with that. But I could not get this code line to work: gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))

Whenever I am executing this it says 0 observations of 5 variables were created.

Thank you! Shayoni

shitiezhu commented 3 years ago

length = end_position - start_position The start and end position are sites on the genome, one will get much bigger length than CDS. So how to get a real gene length matched to mRNA? If I choose "transcript_start", "transcript_end" , I got more than one transcript_start and end sites. Don't know how to do that... Please help.

AAlhendi1707 commented 2 years ago

Dear shitiezhu, the point of useing featureLength is to correct/normalised for effective lengths of features in each library, so we actually here need the full gene length as featureLength.

Nairobi-2020 commented 1 year ago

Dear AAlhendi1707,

All fragments in the library do not include any introns. I think shitezhu is correct. Are we wrong?

AAlhendi1707 commented 1 year ago

Dear Nairobi,

please check this link https://github.com/AAlhendi1707/countToFPKM/issues/1#issuecomment-478263484

raecv commented 1 year ago

I think @shitiezhu is correct for count matrices that are only aligned to exons (i.e. kallisto doesn't pseudoalign to introns): length = end_position - start_position would not be proper since it would overestimate mapped gene region in many cases since it includes introns

https://www.biostars.org/p/83901/ ^ is more appropriate for constructing featureLength values for exon mapped counts matrices

AAlhendi1707 / countToFPKM

Fetch feature lengths info using biomaRt #2