DavisLaboratory / singscore

An R/Bioconductor package that implements a single-sample molecular phenotyping approach
https://davislaboratory.github.io/singscore/
40 stars 5 forks source link

Error row names contain missing values #33

Closed bihuimel closed 2 years ago

bihuimel commented 2 years ago

Hi All,

Thank you for this great tool. I am working with a 19923 by 9723 expression matrix. I can generate rankData successfully, however simpleScore produces an error, see below.

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : row names contain missing values

Code I used are below:

rankData <- rankGenes(dat.mtx) scoredf <- simpleScore(rankData, upSet = pathways.hallmark[[2]], downSet = pathways.hallmark[[1]])

I have tried to subset the dat.mtx file down to the first 10000 rows or the last 10000 rows and then run simpleScore. I get the same error. When checking rankData, I do see rownames. Can you please help me find where the missing values are?

Thanks, Bihui

bhuvad commented 2 years ago

Hi,

I apologise for the delay in responding to you. Would you be able to show me what the pathways.hallmark object and the dat.mtx object (first five rows and columns) look like? Alternatively, you could send me a subset of the dat.mtx and the two signatures to test out on my end. It is very likely some issue with row/column names not matching and I can help diagnose that with sample data.

Cheers, Dharmesh

bihuimel commented 2 years ago

Hi Dharmesh,

Thanks for getting back. Here are the two objects.

dat.mtx[c(1:5), c(1:5)] gives: TCGA-02-0047-01A TCGA-02-0055-01A TCGA-02-2483-01A TCGA-02-2485-01A TCGA-02-2486-01A OR4F5 -3.321928 -3.321928 -3.321928 -3.3219281 -3.321928 OR4F29 -2.902929 -3.321928 -3.321928 -2.8489205 -3.321928 OR4F16 -2.902929 -3.321928 -3.321928 -2.8489205 -3.321928 SAMD11 -1.630394 1.758516 1.801697 -0.1694201 -1.418504 NOC2L 4.136577 5.083277 5.048267 5.1148542 4.354085

pathways.hallmark $Wang.APM [1] "PSMB5" "PSMB6" "PSMB7" "PSMB8" "PSMB9" "PSMB10" "TAP1" "TAP2" "ERAP1" "ERAP2" "CANX" "CALR"
[13] "PDIA3" "TAPBP" "B2M" "HLA-A" "HLA-B" "HLA-C"

$inflam [1] "CCL5" "CD27" "CD274" "CD276" "CD8A" "CMKLR1" "CXCL9" "CXCR6" "HLA-DQA1" "HLA-DRB1" [11] "HLA-E" "IDO1" "LAG3" "NKG7" "PDCD1LG2" "PSMB10" "STAT1" "TIGIT"

I re-ran the code and now the above code does work. However, if I run the code below, it doesn't work. rankData <- rankGenes(dat.mtx) scoredf <- simpleScore(rankData, upSet = pathways.hallmark)

I tried running multiscore but that gives me an error: Error in multiScore(rankData, upSetColc = pathways.hallmark) : all(lapply(upSetColc, class) %in% "GeneSet") is not TRUE

rankData <- rankGenes(dat.mtx) scoredf <- multiScore(rankData, upSetColc = pathways.hallmark)

I'm wondering if I need to convert the character vectors to something else for this multiScore to work?

Thanks, Bihui

bhuvad commented 2 years ago

Thanks for the sample inputs, and I am so sorry for the delayed response!! The reason you are seeing that error is because of the requirement of GeneSet objects when working with multiScore(). The input to multiScore() is either a GeneSetCollection or a list of GeneSet objects. You can create the required input by running the code below:

library(GSEABase)

gsc <- lapply(names(pathways.hallmark), function(x) {
  GeneSet(pathways.hallmark[[x]], setName = x)
}
gsc <- GeneSetCollection(gsc) #optional

If you are using (or are planning to use) MSigDB gene-sets, we have recently developed the msigdb R/Bioconductor package that provides the full collection of MSigDB gene-sets as a GeneSetCollection object that is ready to use with singscore. We have also implemented functions to help you access specific collections/subcollections.

To make things easier for users in the future, I will modify the implementation of multiScore() such that it accepts lists of character vectors as input. This should make it to the next Bioconductor release (3.15).

Cheers, Dharmesh