Closed gtollefson closed 5 years ago
Hi there,
Thanks for your question,
In the example below, I use the same data set that I added as extdata on countToFPKM
to fetch feature lengths info using biomaRt
.
library("countToFPKM")
library("biomaRt")
library("dplyr")
## Import feature counts matrix
file.readcounts <- system.file("extdata", "RNA-seq.read.counts.csv", package="countToFPKM")
counts <- as.matrix(read.csv(file.readcounts))
## Build a biomart query
# In the example below, I use the human gene annotation from Ensembl release 82 located on "sep2015.archive.ensembl.org"
# More about the ensembl_build can be found on "http://www.ensembl.org/info/website/archives/index.html"
ens_build = "sep2015"
dataset="hsapiens_gene_ensembl"
mart = biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset = dataset,
host = paste0(ens_build, ".archive.ensembl.org"), path = "/biomart/martservice", archive = FALSE)
gene.annotations <- biomaRt::getBM(mart = mart, attributes=c("ensembl_gene_id", "external_gene_name", "start_position", "end_position"))
gene.annotations <- dplyr::transmute(gene.annotations, external_gene_name, ensembl_gene_id, length = end_position - start_position)
# Filter and re-order gene.annotations to match the order in feature counts matrix
gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))
gene.annotations <- gene.annotations[order(match(gene.annotations$ensembl_gene_id, rownames(counts))),]
# Assign feature lenghts into a numeric vector.
featureLength <- gene.annotations$length
A good tutoiral about using biomart can be found on DAVE TANG'S BLOG
Hope it helps!
Hi, thank you for this step-by step process!
My featurecounts output has UCSC/Refseq gene ids? Is there a way to convert the code to incorporate that?
Thank you!
Hi excel9
You can use UCSC table browser to download list of known genes with their annotations (gene_id, gene_symbol, start, end), as tab-delimited file. Then you can try:
yourfeaturecounts <- as.matrix(read.csv("yourfeatures.counts.csv"))
ucsc.knowngenes <- read.table(file="ucsc.known.genes.txt", header=T, sep="\t")
ucsc.knowngenes <- dplyr::transmute(ucsc.knowngenes, gene_id, gene_symbol, length = end - start)
ucsc.knowngenes <- ucsc.knowngenes %>% dplyr::filter(gene_id %in% rownames(counts))
ucsc.knowngenes <- ucsc.knowngenes[order(match(ucsc.knowngenes$gene_id, rownames(counts))),]
featureLength <- ucsc.knowngenes$length
Hi Ahmed, Thank you very much for your help! I however downloaded the Ensemble mm10 genome and created a counts file with that. But I could not get this code line to work: gene.annotations <- gene.annotations %>% dplyr::filter(ensembl_gene_id %in% rownames(counts))
Whenever I am executing this it says 0 observations of 5 variables were created.
Thank you! Shayoni
length = end_position - start_position
The start and end position are sites on the genome, one will get much bigger length than CDS. So how to get a real gene length matched to mRNA?
If I choose "transcript_start", "transcript_end"
, I got more than one transcript_start and end sites. Don't know how to do that...
Please help.
Dear shitiezhu, the point of useing featureLength is to correct/normalised for effective lengths of features in each library, so we actually here need the full gene length as featureLength.
Dear AAlhendi1707,
All fragments in the library do not include any introns. I think shitezhu is correct. Are we wrong?
Dear Nairobi,
please check this link https://github.com/AAlhendi1707/countToFPKM/issues/1#issuecomment-478263484
I think @shitiezhu is correct for count matrices that are only aligned to exons (i.e. kallisto doesn't pseudoalign to introns): length = end_position - start_position
would not be proper since it would overestimate mapped gene region in many cases since it includes introns
https://www.biostars.org/p/83901/ ^ is more appropriate for constructing featureLength values for exon mapped counts matrices
I'd like to use your package to calculate FPKM values from my ht-seq counts but am having trouble getting started.
I would like to prepare the
featureLength
vector using BiomaRt but don't see the documentation to do this anywhere on the BiomaRt manual. Is this metric called something else? Can you link me to the description of how to prepare this?