Drizzle-Zhang / scMAGIC

Accurately annotating scRNA-seq data using two rounds of reference-based classification
GNU General Public License v3.0
6 stars 3 forks source link

Cell group 2 is empty - no cells with identity class #8

Closed Tianqi-Ma closed 2 years ago

Tianqi-Ma commented 2 years ago

Hi, Drizzle

I was palying around scMAGIC on a artifical dataset which contains three human cell types: GM12878, HEPG2 and SKBR3. I was trying to determing the limit of scMAGIC so I downsample this dataset as a query data and the original data as reference data. But it returns an error says:

Attaching SeuratObject
[1] "Sum single cell counts matrix:"
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
[1] "Number of overlapped genes:"
[1] 27626
[1] "Start clustering :"
starting worker pid=89759 on localhost:11385 at 10:33:15.868
starting worker pid=89762 on localhost:11385 at 10:33:15.869
starting worker pid=89760 on localhost:11385 at 10:33:15.869
starting worker pid=89761 on localhost:11385 at 10:33:15.870
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
[1] "Clustering completed!"
[1] "Find marker genes of cell types in reference:"
starting worker pid=166623 on localhost:11385 at 10:38:38.117
starting worker pid=166620 on localhost:11385 at 10:38:38.117
starting worker pid=166621 on localhost:11385 at 10:38:38.118
starting worker pid=166622 on localhost:11385 at 10:38:38.119
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: Cell group 2 is empty - no cells with identity class 
Calls: scMAGIC_Seurat ... clusterApply -> staticClusterApply -> checkForRemoteErrors
In addition: Warning messages:
1: In eval(predvars, data, env) : NaNs produced
2: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  number of items to replace is not a multiple of replacement length
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted

I googled a little and find few possible causes:

  1. only one cluster identified https://github.com/satijalab/seurat/issues/13 But I checked the source code, the resolution for clustring is 3, it should return much more than one cluster.
  2. The cell names contain numbers https://github.com/satijalab/seurat/issues/1200 But I checked the names and it is not like what the issue said.
    > colnames(seurat.query)[1:5]
    [1] "AAACCTGCAAAGCAAT-1_1" "AAACCTGCAATCTGCA-1_1" "AAACCTGCACGGATAG-1_1"
    [4] "AAACCTGCAGCATACT-1_1" "AAACCTGCATCGATGT-1_1"
  3. Maybe the query and ref dataset overlap too much, so I also downsample the ref dataset. But the error still showed up.

The only possible solution in my mind is to find another different dataset as ref. Is there any other advice on that?

Drizzle-Zhang commented 2 years ago

Please check whether all of three cell types exist in the new dataset after downsampling.

Tianqi-Ma commented 2 years ago

Please check whether all of three cell types exist in the new dataset after downsampling.

yes, all three types are both in ref and query dataset:

> seurat.ref <- seurat.ref[,sample(colnames(seurat.ref),size=5000,replace=F)]
> table(seurat.ref$celltype)

GM12878   HEPG2   SKBR3
   1338    2990     672
> table(seurat.query$orig.ident)

GM12878   HEPG2   SKBR3
    800   16623    3548

BTW,my strategy was make 2 out of the 3 cell types with fixed number and downsampleing the left one to 8/20/40/80/800 cells to determine the limits of scMAGIC (for example, downsample GM12878 to 8/20/40/80/800 cells and keep HEPG2/SKBR3 unchanged). Initially, I used the dataset which downsample one cell type to only 20 cells and got the error I described. I thought it may because the cell number is too small. Thus I test the 800 downsampled cells, but the error still there. 800 cells should be enough for annotation, I supppose.

Tianqi-Ma commented 2 years ago

Besides, it is not relevant to the data size. I tried 3000 cells (contain 3 cell types), it still reported error.

> seurat.query <- seurat.query[,sample(colnames(seurat.query),size=3000,replace=F)]
> table(seurat.query$orig.ident)

GM12878   HEPG2   SKBR3
   1849     195     956
> table(seurat.ref$celltype)

GM12878   HEPG2   SKBR3
   1369    3011     620
> seurat.ref <- seurat.ref[,sample(colnames(seurat.ref),size=3000,replace=F)]
> table(seurat.ref$celltype)

GM12878   HEPG2   SKBR3
    832    1789     379
> seurat.query <- scMAGIC_Seurat(seurat.query, seurat.ref, atlas = 'HCL', corr_use_HVGene = 3000)
[1] "Sum single cell counts matrix:"
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
[1] "Number of overlapped genes:"
[1] 27626
[1] "Start clustering :"
starting worker pid=189147 on localhost:11713 at 08:20:10.650
starting worker pid=189148 on localhost:11713 at 08:20:10.650
starting worker pid=189146 on localhost:11713 at 08:20:10.650
starting worker pid=189149 on localhost:11713 at 08:20:10.652
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
[1] "Clustering completed!"
[1] "Find marker genes of cell types in reference:"
starting worker pid=200399 on localhost:11713 at 08:20:57.471
starting worker pid=200400 on localhost:11713 at 08:20:57.478
starting worker pid=200398 on localhost:11713 at 08:20:57.504
starting worker pid=200401 on localhost:11713 at 08:20:57.507
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Error in checkForRemoteErrors(val) :
  3 nodes produced errors; first error: Cell group 2 is empty - no cells with identity class
In addition: Warning messages:
1: In eval(predvars, data, env) : NaNs produced
2: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  number of items to replace is not a multiple of replacement length
Drizzle-Zhang commented 2 years ago

I have tried to make some changes to solve the problem. Please reinstall scMAGIC and check whether it can run successfully.

Tianqi-Ma commented 2 years ago

I have tried to make some changes to solve the problem. Please reinstall scMAGIC and check whether it can run successfully.

Thank you so much for updating. And yes, this error has been fixed. Could you please tell me what you've done?

BTW, I encountered another error which relevant to the RAM usage. It may occour when dealing with large dataset. After I apply smaller dataset, It will be gone.

Error in checkForRemoteErrors(val) : 
  one node produced an error: The total size of the 3 globals exported for future expression ('FUN()') is 557.49 MiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). There are three globals: 'data.use' (557.41 MiB of class 'S4'), 'j' (64.98 KiB of class 'numeric') and 'FUN' (14.94 KiB of class 'function')

And I found the solution by adding this option: options(future.globals.maxSize = 1000 * 1024^2) from https://satijalab.org/seurat/articles/future_vignette.html Maybe you can add this setting into package. But be careful, my job got killed on server as this option may take too much resource.

Tianqi-Ma commented 2 years ago

BTW, I also tested on a small dataset (~200 cells), an error returned:


[1] "Sum single cell counts matrix:"
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
[1] "Number of overlapped genes:"
[1] 27626
[1] "Start clustering :"
starting worker pid=76173 on localhost:11302 at 09:38:10.134
starting worker pid=76171 on localhost:11302 at 09:38:10.134
starting worker pid=76172 on localhost:11302 at 09:38:10.135
starting worker pid=76170 on localhost:11302 at 09:38:10.136
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Warning in irlba(A = t(x = object), nv = npcs, ...) :
 Warning in irlba(A = t(x = object), nv = npcs, ...) : You're computing too large a percentage of total singular values, use a standard svd instead.

  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: Error: You should provide a smaller resolution!
Calls: scMAGIC_Seurat ... clusterApply -> staticClusterApply -> checkForRemoteErrors
In addition: Warning messages:
1: In eval(predvars, data, env) : NaNs produced
2: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  number of items to replace is not a multiple of replacement length
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted
Execution halted
Execution halted
Warning message:

Maybe it is better to make imporatant parameters (like resolution) can be passed to functions.

Drizzle-Zhang commented 2 years ago

I have tried to make some changes to solve the problem. Please reinstall scMAGIC and check whether it can run successfully.

Thank you so much for updating. And yes, this error has been fixed. Could you please tell me what you've done?

BTW, I encountered another error which relevant to the RAM usage. It may occour when dealing with large dataset. After I apply smaller dataset, It will be gone.

Error in checkForRemoteErrors(val) : 
  one node produced an error: The total size of the 3 globals exported for future expression ('FUN()') is 557.49 MiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). There are three globals: 'data.use' (557.41 MiB of class 'S4'), 'j' (64.98 KiB of class 'numeric') and 'FUN' (14.94 KiB of class 'function')

And I found the solution by adding this option: options(future.globals.maxSize = 1000 * 1024^2) from https://satijalab.org/seurat/articles/future_vignette.html Maybe you can add this setting into package. But be careful, my job got killed on server as this option may take too much resource.

It's a parameter setting problem, because I didn't consider the situation where there are less than four cell types.

Drizzle-Zhang commented 2 years ago

BTW, I also tested on a small dataset (~200 cells), an error returned:


[1] "Sum single cell counts matrix:"
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
[1] "Number of overlapped genes:"
[1] 27626
[1] "Start clustering :"
starting worker pid=76173 on localhost:11302 at 09:38:10.134
starting worker pid=76171 on localhost:11302 at 09:38:10.134
starting worker pid=76172 on localhost:11302 at 09:38:10.135
starting worker pid=76170 on localhost:11302 at 09:38:10.136
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Warning in irlba(A = t(x = object), nv = npcs, ...) :
 Warning in irlba(A = t(x = object), nv = npcs, ...) : You're computing too large a percentage of total singular values, use a standard svd instead.

  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: Error: You should provide a smaller resolution!
Calls: scMAGIC_Seurat ... clusterApply -> staticClusterApply -> checkForRemoteErrors
In addition: Warning messages:
1: In eval(predvars, data, env) : NaNs produced
2: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  number of items to replace is not a multiple of replacement length
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted
Execution halted
Execution halted
Warning message:

Maybe it is better to make imporatant parameters (like resolution) can be passed to functions.

Thanks very much for your suggestion!

Tianqi-Ma commented 2 years ago

Hi,

I found another tricky issue. So I was subsetting the three-cell-type dataset into 3000/2000/1000/500/300/100 cells of each type to see how many cells in ref are enough to do annotation. I initially used those subset datasets as ref and a randomly sampled 5000 cells as query: for example: ref (3000 cells):

GM12878   HEPG2   SKBR3 
   3000    3000    3000 

query (randomly sampled 5000 cells):

GM12878   HEPG2   SKBR3
   1333    3042     625

it returns an error:

[1] "Find marker genes of cell types in reference:"
starting worker pid=55718 on localhost:11689 at 07:18:43.576
starting worker pid=55716 on localhost:11689 at 07:18:43.577
starting worker pid=55717 on localhost:11689 at 07:18:43.579
starting worker pid=55715 on localhost:11689 at 07:18:43.582
Attaching SeuratObject
Attaching SeuratObject
Attaching SeuratObject
Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: Cell group 2 is empty - no cells with identity class 
Calls: scMAGIC_Seurat ... clusterApply -> staticClusterApply -> checkForRemoteErrors
In addition: Warning messages:
1: In eval(predvars, data, env) : NaNs produced
2: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  number of items to replace is not a multiple of replacement length
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize

And same error can be seen when 100-cell ref used.

But when I do it reversely (3000 cells each type as query and randomly sampled 5000 cells as ref), no error occoured.

Actually, I always get error when using 3000-cell-each-type dataset as ref. And the randomly sampled 5000 cells works almost everytime (well, sometimes can be failed when only less then 4 cells of a cell type in a query ). Very tricky, isn't it?

Is there any special requirements on the ref or query dataset? (number of cell types like you mentioned or number of cells?)

Tianqi-Ma commented 2 years ago

BTW, a new error occoured when using 1000-cell-each-type as query:

[1] "Build local reference"
starting worker pid=36169 on localhost:11257 at 08:23:52.549
starting worker pid=36172 on localhost:11257 at 08:23:52.549
starting worker pid=36171 on localhost:11257 at 08:23:52.553
starting worker pid=36170 on localhost:11257 at 08:23:52.559
Package 'mclust' version 5.4.8
Type 'citation("mclust")' for citing this R package in publications.
Package 'mclust' version 5.4.8
Type 'citation("mclust")' for citing this R package in publications.
Package 'mclust' version 5.4.8
Type 'citation("mclust")' for citing this R package in publications.
Error in checkForRemoteErrors(val) : 
  one node produced an error: missing value where TRUE/FALSE needed
Calls: scMAGIC_Seurat ... clusterApply -> staticClusterApply -> checkForRemoteErrors
In addition: There were 16 warnings (use warnings() to see them)
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Execution halted
Drizzle-Zhang commented 2 years ago

I guess the problem may be related to the number of cell type. Because there are usually more than 3 cell types in a real single cell annotation situation, I didn't encounter the error before. I will test the situation to make scMAGIC more robust.

Drizzle-Zhang commented 2 years ago

I test some examples with 3-cell type reference, but I didn't encounter the error. Could you please send me the download link of these data?

Tianqi-Ma commented 2 years ago

I test some examples with 3-cell type reference, but I didn't encounter the error. Could you please send me the download link of these data?

Sure, here is the ID in GEO: GSM5709379 GSM4471657 GSM3596321

Please check the corresponding cell type when downloading them.

Drizzle-Zhang commented 2 years ago

I find that GM12878 and SKBR3 are count data while HepG2 data is normalized data, which produces NAs and then leads to the error. Although the NAs would be omitted in the latest scMAGIC, I suggest that the format of input should be consistent.

Tianqi-Ma commented 2 years ago

Thanks for replying and I will give it a shot again recently.