Closed kvshams closed 4 years ago
Thank you for trying bayNorm
!
The following code may help you to insert any normalized data, which can be obtained outside Seurat
framework, into a Seurat
object. Then you can find clusters based on that normalized data:
library(bayNorm)
library(Seurat)
data('EXAMPLE_DATA_list')
library(Seurat)
bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE)
x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm')
# x.seurat <- NormalizeData(x.seurat)
x.seurat <- ScaleData(x.seurat)
x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE)
#Specifying: assay='bayNorm'
x.seurat <- RunPCA(x.seurat, pc.genes = x.seurat@var.genes,
pcs.compute = 50, do.print = FALSE,assay='bayNorm')
x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm')
#x.seurat <- JackStraw(x.seurat, prop.freq = 0.06)
x.seurat <- FindNeighbors(x.seurat, dims = 1:10)
x.seurat <- FindClusters(x.seurat, resolution = 0.5)
head(Idents(x.seurat), 5)
#It is a toy example. Here only one cluster was found
plot(x.seurat@reductions$umap@cell.embeddings,pch=16,col=as.factor(Idents(x.seurat)))
#Double check that the assay we used comes from bayNorm
x.seurat@reductions$pca@assay.used
Hope this would be helpful.
Thanks @WT215 for the reply. That worked!.
I have a doubt regarding the case control data. If use the baynorm assay
for the clustering, there is huge batch effect (baynorm is done separately for case and control to preserve the structure, is there any other way ?). Please see the Inf
data from seurat-data . Harmoney
is used for integration.
Any suggestion?
Hi @kvshams ,
Maybe have a try bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.
Is the left panel based on Harmoney
?
@WT215 Was thinking to do so, but did not do by anticipating that would create a true bias in the data structure itself (or am I missing some thing?). For eg, if the case is a KO, then we would see an imputed data for the KO gene also, right?. Thats the reason I was doing the case control baynorm
separately (in this case STIM
and CTRL
). Did not test the KO scenario, but will test and get back to you with an appropriate test data.
Both the clustering above is done by Harmoney
in same way. The only difference is that the assay fed in for the process, ie RNA
and Baynorm
imputed assays respectively.
@WT215, hopefully I can use the baynorm output ( ie the bay_out$Bay_out
) directly to the edgeR differential expression without any additional steps right?.
@WT215, hopefully I can use the baynorm output ( ie the
bay_out$Bay_out
) directly to the edgeR differential expression without any additional steps right?.
I have not tried to applied edgeR
to the bayNorm normalized data. MAST
could be a good choice, see example here: Section: Downstream analysis: DE genes detection.
Hi @kvshams ,
Maybe have a try
bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.Is the left panel based on
Harmoney
?
@WT215 Thanks for the suggestion. Just tried different ways. Why not the GG
. Got a good alignment . Do I need to keep anything in mind while using this method?
Hi @kvshams , Maybe have a try
bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells. Is the left panel based onHarmoney
?@WT215 Thanks for the suggestion. Just tried different ways. Why not the
GG
. Got a good alignment . Do I need to keep anything in mind while using this method?
@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try GG
procedure. LL
procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.
However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.
In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (\bar{\beta}
, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate \beta
accurately. This may lead to a better performance of bayNorm.
Hi @kvshams , Maybe have a try
bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells. Is the left panel based onHarmoney
?@WT215 Thanks for the suggestion. Just tried different ways. Why not the
GG
. Got a good alignment . Do I need to keep anything in mind while using this method?@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try
GG
procedure.LL
procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.
In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (
\bar{\beta}
, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate\beta
accurately. This may lead to a better performance of bayNorm.
Thanks @WT215 for the detailed and quick reply. My main goal for the analysis is to find trajectory, DE and find any excited cells due to the KO/treatment. According to my understanding,
GG
would be best for DE & TrajectoryLL
would best for finding excited cells/signaturesIs that correct?
@kvshams I think you can directly try GG at first (for any purposes), then do cluster detection to find interesting clusters which may due to KO/treatment. Then LL for any subgroups of cells if you want to look at noisy genes in that subgroup.
Thank you for trying
bayNorm
!The following code may help you to insert any normalized data, which can be obtained outside
Seurat
framework, into aSeurat
object. Then you can find clusters based on that normalized data:library(bayNorm) library(Seurat) data('EXAMPLE_DATA_list') library(Seurat) bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE) x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm') # x.seurat <- NormalizeData(x.seurat) x.seurat <- ScaleData(x.seurat) x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE) #Specifying: assay='bayNorm' x.seurat <- RunPCA(x.seurat, pc.genes = x.seurat@var.genes, pcs.compute = 50, do.print = FALSE,assay='bayNorm') x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm') #x.seurat <- JackStraw(x.seurat, prop.freq = 0.06) x.seurat <- FindNeighbors(x.seurat, dims = 1:10) x.seurat <- FindClusters(x.seurat, resolution = 0.5) head(Idents(x.seurat), 5) #It is a toy example. Here only one cluster was found plot(x.seurat@reductions$umap@cell.embeddings,pch=16,col=as.factor(Idents(x.seurat))) #Double check that the assay we used comes from bayNorm x.seurat@reductions$pca@assay.used
Hope this would be helpful.
Do we need to regress out % mitochondria and cell cycle or is it already account the same in any of your prior estimate?
@kvshams
I did not test these two factors rigorously. My suggestion is to filter out most mitochondria related genes at the very beginning (if they are not of interest in your study).
For cell cycles, the GG
procedure may mitigate the difference between cells which belong to different cell cycles. Whether this works maybe checked by visualizing the 2D umap, see if cells were clustered w.r.t to the cell cycles.
All works good with relatively small data. When I try the actual data it exit with some error
BNorm_LL_Sample <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE,
+ Conditions=raw_data@meta.data$orig.ident, Prior_type="LL")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
BNorm_LL_Batch <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE, Conditions=raw_data@meta.data$Batch, Prior_type="
LL")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
This data has 480K cells across 4 groups and 20 samples. As this error may be related to the memory, I tried with 3.8T RAM system also (sadly, we cant get more memory), still get the error. If I split the groups, that works fine but that would be more like to be LL
. Is there any work around for this issue?
I updated the package, try install it via:
#version 1.5.4
library(devtools)
devtools::install_github("WT215/bayNorm")
Now bayNorm
support sparse matrix. Let me know if it still does not work.
BTW, how many genes do you keep?
@WT215, thanks for the fix and update. It does seems works and will update how is the performance.
I use all the remnant genes based on the Seurat
default filtering (thats ~15k). Is there any other criterion for the genes to keep?
@WT215, thanks for the fix and update. It does seems works and will update how is the performance.
I use all the remnant genes based on the
Seurat
default filtering (thats ~15k). Is there any other criterion for the genes to keep?
15k is a reasonable number. Seurat
is already good to be used for this purpose. Another way is, for each gene, look at the proportion of cells which have non-zero counts. Then you can set up a threshold.
I am still updating the package. Currently accessing the row of dgCMatrix
is slower than accessing the row of matrix in R https://stackoverflow.com/questions/47997184/extraction-speed-in-matrix-package-is-very-slow-compared-to-regular-matrix-class.
LL
done without any error. But GG is having an error
Error in t_sp(Data): SpMat::init(): requested size is too large
Traceback:
1. bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions = raw_data@meta.data$orig.ident,
. Prior_type = "GG")
2. Prior_fun(Data = do.call(cbind, DataList_sr), BETA_vec = do.call(c,
. BETAList), parallel = parallel, NCores = NCores, FIX_MU = FIX_MU,
. GR = GR, BB_SIZE = BB_SIZE, verbose = verbose)
3. t_sp(t_sp(Data)/BETA_vec)
4. t_sp(Data)
Yes, I noticed that on my site too. I am trying to make sp_mat
object from RcppArmadillo
to handle large sparse matrix.
how to fix this issue?
I am looking at this discussion: https://stackoverflow.com/questions/40592054/large-matrices-in-rcpparmadillo-via-the-arma-64bit-word-define, but I got the issue posted here when I tried to build the package: https://stackoverflow.com/questions/55248883/why-does-rcpp-function-segfault-only-when-called-from-r-package-but-not-when-sou.
Thanks for that. Please have a try the new version 1.5.5
. It seems that now only the memory size of the machine matters.
My laptop has memory 32G. Currently it failed on matrix of dimension 32738 737280 (gene cell), but seems to be ok with 32738 * 10000.
If as.matrix
only correlates with the memory size of machine, then I probably change back to use matrix
instead of dgCMatrix
object for processing the data in the future.
@WT215 it still persist!. I tried the highest possible memory in GCP (n1-ultramem-160 3.8TB) :(
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Calls: bayNorm ... EstPrior -> <Anonymous> -> - -> - -> as -> asMethod
Execution halted
What is the error message?
What is the error message? updated the previous comment with the error
Maybe we need to debug in the following steps:
rout=Matrix::rowMeans(raw_data@assays$RNA@counts)
be run?qq=as.matrix(raw_data@assays$RNA@counts)
be run successfully on your site?rout=Matrix::rowMeans(qq)
?Step1 is working fine. step to have the error
qq=as.matrix(raw_data@assays$RNA@counts)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
as step is not working, the step 3 cant be tested
If
as.matrix
only correlates with the memory size of machine, then I probably change back to usematrix
instead ofdgCMatrix
object for processing the data in the future.
Falling to matrix
is the option here I guess?
@kvshams I just updated the 1.5.7
, hopefully it works.
Error still persists with rccp (even with the highest RAM system).
error: Mat::init(): requested size is too large
Error in EstPrior_rcpp(Data = Data) :
Mat::init(): requested size is too large
Calls: bayNorm -> Prior_fun -> EstPrior -> EstPrior_rcpp
Execution halted
I replaced the Rcpp function with original one in 1.5.8
. You can directly try:
MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)
Can that be run?
Thank you for the update. Now I get this.
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
| | 0%Error in serialize(data, node$con) : ignoring SIGPIPE signal
MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)
Priors estimation based on MME method has completed.
@kvshams By learning from the R package liger
, I am using dgCMatrix
throughout the process. Please have a try the 1.5.9
. I can run bayNorm
on a sparse matrix of dimension 32738 genes * 737280 cells
on my laptop (32G memory, 5 cores specified in bayNorm
. Though that test was not completed, it arrived at 1% at time-consuming
part. Hence it seems to work).
The output is still of type matrix
. If it works, I may allow it to be of type dgCMatrix
too. In case that the output is too large to be stored.
The error %Error in serialize(data, node$con) : ignoring SIGPIPE signal
is probably due to that we used matrix
. So the data which was distributed to each cores is still too large, hence the core is highly likely to be terminated. Now with dgCMatrix
, it may works.
UPDATE: now it seems working... will update the performance onece it will get complete (as it may take few hours to get the result, I keeping the log here). Thanks a lot for the quick fix!
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|============= | 18%
Yes it works now!. Thanks a lot.
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions=raw_data@meta.data$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|======================================================================| 100%
Prior estimation has completed!
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Glad to hear that! I will also allow mean version to return sparse matrix later.
I am having very hard time in making the seurat object from the baynorm
output (450k Cells across 20 samples). I think the issue is to convert all the samples to one dgCMatrix
. If I am trying to have a work around like
Mtrix_from_Baynorm <- cbind(baynorm$Bay_out_lis)
my.counts <- as(Mtrix_from_Baynorm, "dgCMatrix")
Segmentation fault
Is there any way I can put all the sample into a single dgCMatrix
in baynorm
output?
If I am making one object per sample from the Bay_out_lis
and use the Seurat merge
option, then also it breaks due to memory error.
Hi @kvshams ,
The correct way for cbind
matricies in a list
is Mtrix_from_Baynorm <- do.call(cbind,baynorm$Bay_out_list)
, maybe try this at first, since you have already had that result.
If that does not solve the problem, then another way could be:
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG")
By setting Conditions=NULL
, the output will be a full matrix instead stored in a list. Hence there is no need to cbind
. Since Prior_type="GG"
, specifying Conditions
is not necessary. You can call the subgroup of cells according to the cell names later. This way may save the memory.
Then, x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm')
. No need to convert it to dgCMatrix
, that code (CreateSeuratObject
) will convert it inside that function.
Thanks @WT215 for reopening the issue and suggestions. I have already tried both cbind and cbind2 ie as in
do.call(cbind,baynorm$Bay_out_list)
In both the case seurat
fails (I presume it is in dgcmatrix
converting step as the input from baynorm
is a matrix
and the error is exactly similar that I get to convert baynorm
output matrix to the dgcmatrix
).
Error in .cbind2Csp(x, y) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Let me try the GG as you suggested and see how it goes.
I just updated the 1.5.13
. Now it allows dgCMatrix
throughout the whole process (Input which is not of dgCMatrix
will be converted to dgCMatrix
).
Have a try at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG",out.sparse = TRUE)
. Then you will get
> class(at_GG$Bay_out)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
The rerun with new version getting killed
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, parallel = TRUE,NCores = 50, Prior_type="GG", out.sparse = TRUE)
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|======================================================================| 100%
Prior estimation has completed!
**********************************************************
***********
Killed
Just updated 1.5.14
, please have a try.
@WT215 I think the issue is with the matrix. The baynorm
out is no more a true sparse one and almost all the col*rows are filled with values imputed by baynorm
(many of them are near to zero). Is'nt it worth putting all values <0 to 0 to make it memory efficient?. Is that statistically correct according to your model?
Yes, the output is no more sparse. I don't think replacing <1 with 0 can help. It seems that there is no solution, I need to think about it.
How about filtering out some cells according to the total counts at the very beginning?
In the output replacing <1 with 0 helps with the cost of time! (conversion time). And the v 1.5.14
also works now (I just re-installed the Rccp
and RcppArmadillo
). 1.5.14
is much faster and memory efficient than any previous version!.
I don't think replacing <1 with 0 is correct. Does '1.5.14' work without replacing <1?
Yes, it works without replacing it. I am saving that object and now will see how does it sync with Seurat
Now it gives a new error
at_GG$Bay_out
22865 x 442010 sparse Matrix of class "dgCMatrix"
Error in validityMethod(as(object, superClass)) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
Try not to access the whole data directly: at_GG$Bay_out
.
Can x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm')
be run?
Tried it first time itself and got the same error.
On Thu, Jan 9, 2020 at 11:07 PM Wenhao notifications@github.com wrote:
Try not to access the whole data directly: at_GG$Bay_out.
Can x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm') be run?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WT215/bayNorm/issues/1?email_source=notifications&email_token=ABTFBLC3KQBZIZ6IJ4FSCR3Q47YAHA5CNFSM4J2QMCZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISTPMY#issuecomment-572864435, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTFBLHVWPTEWJETGCGGFELQ47YAHANCNFSM4J2QMCZQ .
-- Email from machine with auto correction = Cell phone
Thanks for the cool tool!.
Is there any work around on the Seurat object? Thanks, Shams