federicomarini / quantiseqr

https://federicomarini.github.io/quantiseqr/
GNU General Public License v3.0
0 stars 2 forks source link

Quantiseq fails with "duplicate 'row names are not allowed' #22

Open Christian-Heyer opened 1 year ago

Christian-Heyer commented 1 year ago

https://github.com/federicomarini/quantiseqr/blob/94e91fc190c8d876b15c0ae52199a5e7deb0f9ab/R/quantiseqr_helpers.R#L433

I am attempting to run quantiseq on the expression data from TCGA-STAD which I downloaded using TCGAbiolinks.

Before running I remove all duplicate gene names from the data matrix, howevermapGenes introduces new duplicates in the newgenes vector however in in the line 433 of quantiseqr_helpers.R of mapGenes referenced above the original data matrix is checked for duplicate genes instead of the newgenes vector.

I am not quite sure what mapGenes does and why it is introducing duplicate HGNC gene symbols when the original matrix has none,but changing it may make sense to check for duplicates in the newgenes vector instead of the original data matrix.


> deconvolute(expr_mat, "quantiseq",
+             tumor = TRUE,)

>>> Running quantiseq

Running quanTIseq deconvolution module

Gene expression normalization and re-annotation (arrays: FALSE)

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': 'entry withdrawn' 
> traceback()
10: stop("duplicate 'row.names' are not allowed")
9: `.rowNamesDF<-`(x, value = value)
8: `row.names<-.data.frame`(`*tmp*`, value = value)
7: `row.names<-`(`*tmp*`, value = value)
6: `rownames<-`(`*tmp*`, value = newgenes)
5: mapGenes(mix.mat)
4: fixMixture(mix.mat, arrays = is_arraydata)
3: quantiseqr::run_quantiseq(expression_data = gene_expression_matrix, 
       is_arraydata = arrays, is_tumordata = tumor, scale_mRNA = scale_mrna, 
       ...)
2: deconvolute_quantiseq(gene_expression, tumor = tumor, arrays = arrays, 
       scale_mrna = scale_mrna, ...)
1: deconvolute(tpm_mat, "quantiseq", tumor = TRUE, )

Download TCGASTAD data:

library(TCGAbiolinks) 
library("TCGAbiolinks")

query <- TCGAbiolinks::GDCquery(
    project = "TCGA-STAD", 
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",workflow.type = "STAR - Counts"
)
GDCdownload(query)
STAD.Rnaseq.SE <- GDCprepare(query)

expr_mat <- assays(STAD_SE)$tpm_unstrand
rownames(expr_mat) <- rowData(STAD_SE)$gene_name
# Remove genes not expressed in the dataset
expr_mat <- expr_mat[rowSums(expr_mat) != 0,]
# Blindly remove all duplicates
expr_mat <- expr_mat[!duplicated(rownames(expr_mat)),]
federicomarini commented 1 year ago

Hi Christian, @FFinotello has recently committed a change that could fix the behavior -> https://github.com/federicomarini/quantiseqr/commit/181ea43a0defae6a2e5dc13fa8abcd06980f45b5 Could you please double check the behavior is now correct?

michael-mazzucco commented 1 year ago

I am also still having this exact same issue!