WT215 / bayNorm

Normalization for single cell RNA-seq data
9 stars 3 forks source link

Work around for seurat object #1

Closed kvshams closed 4 years ago

kvshams commented 4 years ago

Thanks for the cool tool!.

Is there any work around on the Seurat object? Thanks, Shams

WT215 commented 4 years ago

Thank you for trying bayNorm!

The following code may help you to insert any normalized data, which can be obtained outside Seurat framework, into a Seurat object. Then you can find clusters based on that normalized data:

library(bayNorm)
library(Seurat)

data('EXAMPLE_DATA_list')

library(Seurat)
bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE)
x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm')
# x.seurat <- NormalizeData(x.seurat)
x.seurat <- ScaleData(x.seurat)
x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE)
#Specifying: assay='bayNorm'
x.seurat <- RunPCA(x.seurat, pc.genes = x.seurat@var.genes,
                   pcs.compute = 50, do.print = FALSE,assay='bayNorm')

x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm')
#x.seurat <- JackStraw(x.seurat, prop.freq = 0.06)
x.seurat <- FindNeighbors(x.seurat, dims = 1:10)
x.seurat <- FindClusters(x.seurat, resolution = 0.5)
head(Idents(x.seurat), 5)
#It is a toy example. Here only one cluster was found
plot(x.seurat@reductions$umap@cell.embeddings,pch=16,col=as.factor(Idents(x.seurat)))
#Double check that the assay we used comes from bayNorm
x.seurat@reductions$pca@assay.used

Hope this would be helpful.

kvshams commented 4 years ago

Thanks @WT215 for the reply. That worked!.

I have a doubt regarding the case control data. If use the baynorm assay for the clustering, there is huge batch effect (baynorm is done separately for case and control to preserve the structure, is there any other way ?). Please see the Inf data from seurat-data . Harmoney is used for integration.

image

Any suggestion?

WT215 commented 4 years ago

Hi @kvshams ,

Maybe have a try bayNorm(Data=cbind(case, control)) (pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.

Is the left panel based on Harmoney?

kvshams commented 4 years ago

@WT215 Was thinking to do so, but did not do by anticipating that would create a true bias in the data structure itself (or am I missing some thing?). For eg, if the case is a KO, then we would see an imputed data for the KO gene also, right?. Thats the reason I was doing the case control baynorm separately (in this case STIM and CTRL). Did not test the KO scenario, but will test and get back to you with an appropriate test data.

Both the clustering above is done by Harmoney in same way. The only difference is that the assay fed in for the process, ie RNA and Baynorm imputed assays respectively.

kvshams commented 4 years ago

@WT215, hopefully I can use the baynorm output ( ie the bay_out$Bay_out) directly to the edgeR differential expression without any additional steps right?.

WT215 commented 4 years ago

@WT215, hopefully I can use the baynorm output ( ie the bay_out$Bay_out) directly to the edgeR differential expression without any additional steps right?.

I have not tried to applied edgeR to the bayNorm normalized data. MAST could be a good choice, see example here: Section: Downstream analysis: DE genes detection.

kvshams commented 4 years ago

Hi @kvshams ,

Maybe have a try bayNorm(Data=cbind(case, control)) (pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.

Is the left panel based on Harmoney?

@WT215 Thanks for the suggestion. Just tried different ways. Why not the GG. Got a good alignment . Do I need to keep anything in mind while using this method? image

WT215 commented 4 years ago

Hi @kvshams , Maybe have a try bayNorm(Data=cbind(case, control)) (pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells. Is the left panel based on Harmoney?

@WT215 Thanks for the suggestion. Just tried different ways. Why not the GG. Got a good alignment . Do I need to keep anything in mind while using this method? image

@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try GG procedure. LL procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.

However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.

In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (\bar{\beta}, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate \beta accurately. This may lead to a better performance of bayNorm.

kvshams commented 4 years ago

Hi @kvshams , Maybe have a try bayNorm(Data=cbind(case, control)) (pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells. Is the left panel based on Harmoney?

@WT215 Thanks for the suggestion. Just tried different ways. Why not the GG. Got a good alignment . Do I need to keep anything in mind while using this method? image

@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try GG procedure. LL procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.

However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.

In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (\bar{\beta}, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate \beta accurately. This may lead to a better performance of bayNorm.

Thanks @WT215 for the detailed and quick reply. My main goal for the analysis is to find trajectory, DE and find any excited cells due to the KO/treatment. According to my understanding,

  1. GG would be best for DE & Trajectory
  2. LL would best for finding excited cells/signatures

Is that correct?

WT215 commented 4 years ago

@kvshams I think you can directly try GG at first (for any purposes), then do cluster detection to find interesting clusters which may due to KO/treatment. Then LL for any subgroups of cells if you want to look at noisy genes in that subgroup.

kvshams commented 4 years ago

Thank you for trying bayNorm!

The following code may help you to insert any normalized data, which can be obtained outside Seurat framework, into a Seurat object. Then you can find clusters based on that normalized data:

library(bayNorm)
library(Seurat)

data('EXAMPLE_DATA_list')

library(Seurat)
bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE)
x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm')
# x.seurat <- NormalizeData(x.seurat)
x.seurat <- ScaleData(x.seurat)
x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE)
#Specifying: assay='bayNorm'
x.seurat <- RunPCA(x.seurat, pc.genes = x.seurat@var.genes,
                   pcs.compute = 50, do.print = FALSE,assay='bayNorm')

x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm')
#x.seurat <- JackStraw(x.seurat, prop.freq = 0.06)
x.seurat <- FindNeighbors(x.seurat, dims = 1:10)
x.seurat <- FindClusters(x.seurat, resolution = 0.5)
head(Idents(x.seurat), 5)
#It is a toy example. Here only one cluster was found
plot(x.seurat@reductions$umap@cell.embeddings,pch=16,col=as.factor(Idents(x.seurat)))
#Double check that the assay we used comes from bayNorm
x.seurat@reductions$pca@assay.used

Hope this would be helpful.

Do we need to regress out % mitochondria and cell cycle or is it already account the same in any of your prior estimate?

WT215 commented 4 years ago

@kvshams

I did not test these two factors rigorously. My suggestion is to filter out most mitochondria related genes at the very beginning (if they are not of interest in your study).

For cell cycles, the GG procedure may mitigate the difference between cells which belong to different cell cycles. Whether this works maybe checked by visualizing the 2D umap, see if cells were clustered w.r.t to the cell cycles.

kvshams commented 4 years ago

All works good with relatively small data. When I try the actual data it exit with some error

BNorm_LL_Sample <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE,
+                   Conditions=raw_data@meta.data$orig.ident, Prior_type="LL")
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
BNorm_LL_Batch <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE, Conditions=raw_data@meta.data$Batch, Prior_type="
LL")
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

This data has 480K cells across 4 groups and 20 samples. As this error may be related to the memory, I tried with 3.8T RAM system also (sadly, we cant get more memory), still get the error. If I split the groups, that works fine but that would be more like to be LL. Is there any work around for this issue?

WT215 commented 4 years ago

I updated the package, try install it via:

#version 1.5.4
library(devtools)
devtools::install_github("WT215/bayNorm")

Now bayNorm support sparse matrix. Let me know if it still does not work.

BTW, how many genes do you keep?

kvshams commented 4 years ago

@WT215, thanks for the fix and update. It does seems works and will update how is the performance.

I use all the remnant genes based on the Seurat default filtering (thats ~15k). Is there any other criterion for the genes to keep?

WT215 commented 4 years ago

@WT215, thanks for the fix and update. It does seems works and will update how is the performance.

I use all the remnant genes based on the Seurat default filtering (thats ~15k). Is there any other criterion for the genes to keep?

15k is a reasonable number. Seurat is already good to be used for this purpose. Another way is, for each gene, look at the proportion of cells which have non-zero counts. Then you can set up a threshold.

I am still updating the package. Currently accessing the row of dgCMatrix is slower than accessing the row of matrix in R https://stackoverflow.com/questions/47997184/extraction-speed-in-matrix-package-is-very-slow-compared-to-regular-matrix-class.

kvshams commented 4 years ago

LL done without any error. But GG is having an error

Error in t_sp(Data): SpMat::init(): requested size is too large
Traceback:

1. bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions = raw_data@meta.data$orig.ident, 
 .     Prior_type = "GG")
2. Prior_fun(Data = do.call(cbind, DataList_sr), BETA_vec = do.call(c, 
 .     BETAList), parallel = parallel, NCores = NCores, FIX_MU = FIX_MU, 
 .     GR = GR, BB_SIZE = BB_SIZE, verbose = verbose)
3. t_sp(t_sp(Data)/BETA_vec)
4. t_sp(Data)
WT215 commented 4 years ago

Yes, I noticed that on my site too. I am trying to make sp_mat object from RcppArmadillo to handle large sparse matrix.

kvshams commented 4 years ago

how to fix this issue?

WT215 commented 4 years ago

I am looking at this discussion: https://stackoverflow.com/questions/40592054/large-matrices-in-rcpparmadillo-via-the-arma-64bit-word-define, but I got the issue posted here when I tried to build the package: https://stackoverflow.com/questions/55248883/why-does-rcpp-function-segfault-only-when-called-from-r-package-but-not-when-sou.

kvshams commented 4 years ago

I think its worth looking this thread

WT215 commented 4 years ago

Thanks for that. Please have a try the new version 1.5.5. It seems that now only the memory size of the machine matters.

My laptop has memory 32G. Currently it failed on matrix of dimension 32738 737280 (gene cell), but seems to be ok with 32738 * 10000.

If as.matrix only correlates with the memory size of machine, then I probably change back to use matrix instead of dgCMatrix object for processing the data in the future.

kvshams commented 4 years ago

@WT215 it still persist!. I tried the highest possible memory in GCP (n1-ultramem-160 3.8TB) :(

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Calls: bayNorm ... EstPrior -> <Anonymous> -> - -> - -> as -> asMethod
Execution halted
WT215 commented 4 years ago

What is the error message?

kvshams commented 4 years ago

What is the error message? updated the previous comment with the error

WT215 commented 4 years ago

Maybe we need to debug in the following steps:

  1. Can rout=Matrix::rowMeans(raw_data@assays$RNA@counts) be run?
  2. Can qq=as.matrix(raw_data@assays$RNA@counts) be run successfully on your site?
  3. If the step 2 is ok, then how about rout=Matrix::rowMeans(qq)?
kvshams commented 4 years ago

Step1 is working fine. step to have the error

qq=as.matrix(raw_data@assays$RNA@counts)
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

as step is not working, the step 3 cant be tested

kvshams commented 4 years ago

If as.matrix only correlates with the memory size of machine, then I probably change back to use matrix instead of dgCMatrix object for processing the data in the future.

Falling to matrix is the option here I guess?

WT215 commented 4 years ago

@kvshams I just updated the 1.5.7, hopefully it works.

kvshams commented 4 years ago

Error still persists with rccp (even with the highest RAM system).

error: Mat::init(): requested size is too large
Error in EstPrior_rcpp(Data = Data) : 
  Mat::init(): requested size is too large
Calls: bayNorm -> Prior_fun -> EstPrior -> EstPrior_rcpp
Execution halted
WT215 commented 4 years ago

I replaced the Rcpp function with original one in 1.5.8. You can directly try:

MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)

Can that be run?

kvshams commented 4 years ago

Thank you for the update. Now I get this.

at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")

 Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
  |                                                                      |   0%Error in serialize(data, node$con) : ignoring SIGPIPE signal
MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)
Priors estimation based on MME method has completed.
WT215 commented 4 years ago

@kvshams By learning from the R package liger, I am using dgCMatrix throughout the process. Please have a try the 1.5.9. I can run bayNorm on a sparse matrix of dimension 32738 genes * 737280 cells on my laptop (32G memory, 5 cores specified in bayNorm. Though that test was not completed, it arrived at 1% at time-consuming part. Hence it seems to work).

The output is still of type matrix. If it works, I may allow it to be of type dgCMatrix too. In case that the output is too large to be stored.

The error %Error in serialize(data, node$con) : ignoring SIGPIPE signal is probably due to that we used matrix. So the data which was distributed to each cores is still too large, hence the core is highly likely to be terminated. Now with dgCMatrix, it may works.

kvshams commented 4 years ago

UPDATE: now it seems working... will update the performance onece it will get complete (as it may take few hours to get the result, I keeping the log here). Thanks a lot for the quick fix!

at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
  |=============                                                         |  18%
kvshams commented 4 years ago

Yes it works now!. Thanks a lot.


at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.

|======================================================================| 100%
Prior estimation has completed!
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
WT215 commented 4 years ago

Glad to hear that! I will also allow mean version to return sparse matrix later.

kvshams commented 4 years ago

I am having very hard time in making the seurat object from the baynorm output (450k Cells across 20 samples). I think the issue is to convert all the samples to one dgCMatrix . If I am trying to have a work around like

Mtrix_from_Baynorm <- cbind(baynorm$Bay_out_lis)
my.counts <- as(Mtrix_from_Baynorm, "dgCMatrix")
Segmentation fault

Is there any way I can put all the sample into a single dgCMatrix in baynorm output?

If I am making one object per sample from the Bay_out_lis and use the Seurat merge option, then also it breaks due to memory error.

WT215 commented 4 years ago

Hi @kvshams ,

The correct way for cbind matricies in a list is Mtrix_from_Baynorm <- do.call(cbind,baynorm$Bay_out_list), maybe try this at first, since you have already had that result.

If that does not solve the problem, then another way could be:

at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG")

By setting Conditions=NULL, the output will be a full matrix instead stored in a list. Hence there is no need to cbind. Since Prior_type="GG", specifying Conditions is not necessary. You can call the subgroup of cells according to the cell names later. This way may save the memory.

Then, x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm'). No need to convert it to dgCMatrix, that code (CreateSeuratObject) will convert it inside that function.

kvshams commented 4 years ago

Thanks @WT215 for reopening the issue and suggestions. I have already tried both cbind and cbind2 ie as in

 do.call(cbind,baynorm$Bay_out_list)

In both the case seurat fails (I presume it is in dgcmatrix converting step as the input from baynorm is a matrix and the error is exactly similar that I get to convert baynorm output matrix to the dgcmatrix).

Error in .cbind2Csp(x, y) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92

Let me try the GG as you suggested and see how it goes.

WT215 commented 4 years ago

I just updated the 1.5.13. Now it allows dgCMatrix throughout the whole process (Input which is not of dgCMatrix will be converted to dgCMatrix).

Have a try at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG",out.sparse = TRUE). Then you will get

> class(at_GG$Bay_out)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
kvshams commented 4 years ago

The rerun with new version getting killed

at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, parallel = TRUE,NCores = 50, Prior_type="GG", out.sparse = TRUE)

Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|======================================================================| 100%
Prior estimation has completed!
**********************************************************
***********
Killed
WT215 commented 4 years ago

Just updated 1.5.14, please have a try.

kvshams commented 4 years ago

@WT215 I think the issue is with the matrix. The baynorm out is no more a true sparse one and almost all the col*rows are filled with values imputed by baynorm (many of them are near to zero). Is'nt it worth putting all values <0 to 0 to make it memory efficient?. Is that statistically correct according to your model?

WT215 commented 4 years ago

Yes, the output is no more sparse. I don't think replacing <1 with 0 can help. It seems that there is no solution, I need to think about it.

How about filtering out some cells according to the total counts at the very beginning?

kvshams commented 4 years ago

In the output replacing <1 with 0 helps with the cost of time! (conversion time). And the v 1.5.14 also works now (I just re-installed the Rccp and RcppArmadillo). 1.5.14 is much faster and memory efficient than any previous version!.

WT215 commented 4 years ago

I don't think replacing <1 with 0 is correct. Does '1.5.14' work without replacing <1?

kvshams commented 4 years ago

Yes, it works without replacing it. I am saving that object and now will see how does it sync with Seurat

kvshams commented 4 years ago

Now it gives a new error

at_GG$Bay_out
22865 x 442010 sparse Matrix of class "dgCMatrix"
Error in validityMethod(as(object, superClass)) :
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
WT215 commented 4 years ago

Try not to access the whole data directly: at_GG$Bay_out.

Can x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm') be run?

kvshams commented 4 years ago

Tried it first time itself and got the same error.

On Thu, Jan 9, 2020 at 11:07 PM Wenhao notifications@github.com wrote:

Try not to access the whole data directly: at_GG$Bay_out.

Can x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm') be run?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WT215/bayNorm/issues/1?email_source=notifications&email_token=ABTFBLC3KQBZIZ6IJ4FSCR3Q47YAHA5CNFSM4J2QMCZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISTPMY#issuecomment-572864435, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTFBLHVWPTEWJETGCGGFELQ47YAHANCNFSM4J2QMCZQ .

-- Email from machine with auto correction = Cell phone