CCBR / RENEE

A comprehensive quality-control and quantification RNA-seq pipeline
https://CCBR.github.io/RENEE/
MIT License
3 stars 4 forks source link

Feature request: RSEM outputs - counts matrix for isoforms #137

Closed TBrownmiller closed 2 weeks ago

TBrownmiller commented 1 month ago

Hello,

Would it be possible to add to the RSEM its feature to generate a counts matrix for the isoforms data? Within the RSEM documentation it states it has the ability to do this using the following command: rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix

I was able to do this manually by loading RSEM as a module in biowulf so there is no rush for this, but I thought it would be useful since many downstream tools use count matrices

Thanks!

kelly-sovacool commented 1 month ago

Hi @TBrownmiller, thanks for your request. RENEE outputs both gene and isoform counts -- the isoform count matrix is DEG_ALL/RSEM.isoforms.expected_count.all_samples.txt. Is this what you're looking for?

TBrownmiller commented 1 month ago

Sort of. I think its a difference in formatting of the outputs. One of the R packages I use (EBSeq) that is usually directly compatible with RSEM outputs asks for a matrix file (file extension ".MATRIX") format, but the RENEE generated outputs are either a txt or tsv format which aren't directly compatible.

kelly-sovacool commented 1 month ago

Gotcha. We'll make this available in the next release of RENEE -- v2.6.

samarth8392 commented 3 weeks ago

Hello @TBrownmiller , Just to follow up on your enquiry, I was wondering what error you receive when you try using the RSEM.isoforms.expected_count.all_samples.txt file in EBSeq?

From the package vignette, it says, The object Data should be a G − by − S matrix containing the expression values for each gene and each sample, where G is the number of genes and S is the number of samples. These values should exhibit raw counts, without normalization across samples.

And the RSEM.isoforms.expected_count.all_samples.txt output file looks like:

gene_id GeneName        transcript_id   sample1 sample2
ENSG00000277411.1       5S_rRNA ENST00000614916.1       0.0     0.0
ENSG00000273730.1       5_8S_rRNA       ENST00000619779.1       11.88   12.62
...
ENSG00000268895.6       A1BG-AS1        ENST00000595302.1       33.26   12.7
ENSG00000268895.6       A1BG-AS1        ENST00000594950.5       0.0     0.0
ENSG00000268895.6       A1BG-AS1        ENST00000593960.6       21.17   10.29

You can create a new data matrix with just rownames and expression values. Try the following code:

library(dplyr)
library(tibble)
library(EBSeq)

df <- read.table("RSEM.isoforms.expected_count.all_samples.txt", header=T)
gene.matrix <- df %>% 
mutate(gene=paste(gene_id,GeneName,transcript_id, sep="_") %>%
select(-c(gene_id,GeneName,transcript_id)) %>%
column_to_rownames("gene") %>%
as.matrix()

The gene.matrix should work with EBSeq.

Let us know if that works.

kelly-sovacool commented 2 weeks ago

@samarth8392 thanks for posting the R code to transform the count table into a matrix.

Vishal and I discussed this issue and decided to go ahead and add a rule to create the matrix with rsem -- it runs very quickly and doesn't add much overhead at all. This way our users won't have to transform the other output themselves. See https://github.com/CCBR/RENEE/pull/149