SimonSchlumbohm / HarmonizR

11 stars 3 forks source link

Error when applying ComBat to the submatrices #6

Closed cvanderaa closed 9 months ago

cvanderaa commented 9 months ago

Hello,

I am interested to test HarmonizR on single-cell proteomics data. However, I run into an error that doesn't provide much guidance on how to solve it. Here is a minimal example, using the SCoPE2 data set published by the Slavov lab [1].

library(HarmonizR)
library(scpdata)
library(scp)
data <- specht2019v3()
sce <- getWithColData(data, "peptides")
sce <- filterNA(sce, pNA = 0.9)
input <- t(assay(sce))
design <- data.frame(
    ID = colnames(sce), sample = 1:ncol(sce), 
    batch = as.factor(sce$Set)
)
harmonizR(
    as.data.frame(input), design, 
    algorithm = "ComBat", ComBat_mode = 1
)
[1] "Reading the files..."
[1] "Preparing..."
[1] "Splitting the data using ComBat adjustment..."
Error in `[.data.frame`(x, r, vars, drop = drop) : 
  undefined columns selected

I guess it has to do with how I provide the data. I have read the SOP and the above example should comply to it, but still I run into an error. Your help would be much appreciated :pray:

[1] Specht, Harrison, Edward Emmott, Aleksandra A. Petelski, R. Gray Huffman, David H. Perlman, Marco Serra, Peter Kharchenko, Antonius Koller, and Nikolai Slavov. 2021. “Single-Cell Proteomic and Transcriptomic Analysis of Macrophage Heterogeneity Using SCoPE2.” Genome Biology 22 (1): 50.

cvanderaa commented 9 months ago

Nevermind, I added an unnecessary transpose, apologies for the noise...

SimonSchlumbohm commented 9 months ago

Dear @cvanderaa

I am glad you could solve the issue. If any questions remain, please feel free to contact me again :)

Best

Simon Schlumbohm

cvanderaa commented 9 months ago

Hello Simon,

Thank you for your kind answer. While the problem above is solved, I'm facing another issue which I would love to hear your thoughts.

Context

As I said, I want to apply HarmonizR on single-cell data. I have a data sets with about 1,500 cells (= samples) acquired in 130 MS batches for which I want to correct for batch effects. After QC and filtering on missing values, the data contain 6,700 peptides (I want to avoid protein aggregation). Single-cell proteomics are characterized by high proportion of missing values, meaning that the number of unique batch combination is enormous (in this example, there are 6,679 observed combinations).

Result

I had to change the code to use only 2 cores instead of all cores otherwise my R session would crash due to lack of RAM. As expected, it therefore took quite some time (about 2.5 hours) for HarmonizR to apply ComBat to each split, but I can live with that. However, harmonizR() returned a cured data set that only contains 29 peptides instead of the initial 6,700. After exploring your code I found a line that says:

https://github.com/SimonSchlumbohm/HarmonizR/blob/5fb9ba1d119aa8c78b8000f8b53bcc9d27eba368/HarmonizR/R/splitting.r#L153-L154

In my case, the number of rows of each split is almost always 1, meaning that almost all peptides show a unique combination of batches.

nrow:         1    2    3    4   12 
frequency: 6671    5    1    1    1 

Question

Could you think of a quick fix to circumvent the need for having more than one feature per split?

I expect this to be frequent for single-cell data, but I understand that single-cell applications may be outside the scope of your software. Therefore, I'll let you decide whether it is worth reopening the issue or not.

Many thanks in advance.

SimonSchlumbohm commented 9 months ago

Dear @cvanderaa

if you want, you can test things with the newest HarmonizR version, which is currently being worked on by me.

Please be aware that since it is in the process of being published, it is not really citable at the moment. It should, however, fix your issue. As for single-cell data, I am currently working on an extension specifically for this, but forseeing satisfactory results is difficult as of now. If you use Bioconductor, it should also be installable (from their devel branch) by using BiocManager::install("HarmonizR")

Best

Simon Schlumbohm

cvanderaa commented 9 months ago

Great news! Thanks for the link, I'll give it a try!

cvanderaa commented 9 months ago

I confirm the newer version solved my problem. Many thanks again.

(Self-promoting advertisement) if you are looking for quantitative single-cell proteomics data to test your method, you can have a look at scpdata :wink:

SimonSchlumbohm commented 9 months ago

I'm glad the problem was solved. Your data may indeed come in very handy for further sc-pipeline implementation! Thanks a lot.