SONGDONGYUAN1994 / scDesign3

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics
https://songdongyuan1994.github.io/scDesign3/docs/index.html
MIT License
81 stars 23 forks source link

Issue with adjusting for library size #7

Closed emucaki closed 1 year ago

emucaki commented 1 year ago

I was following the tutorial page describing how to adjust for library size: https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3-librarySize-vignette.html

I formatted my 'sce' file identically from the sample data from the DuoClustering2018 library, where "cell_type" was a factor and "library" was numeric (no NAs). However, when I attempted to include this information in the 'mu' formula:

mu_formula = "cell_type + offset(log(library))"

I got the following error message:

Error in log(library) : non-numeric argument to mathematical function

Is there anything you could suggest to resolve this issue? I was able to avoid the error by specifying the 'sce' variable:

mu_formula="cell_type + offset(log(colData(sce)$library))"

However, I'm concerned that might cause issues in the background that I'm not able to see.

SONGDONGYUAN1994 commented 1 year ago

Hi, Thank you for your interest in our work. Could you provide a demo data to reproduce the error? In addition, can you run this vignette correctly on your machine?

Best, Dongyuan

emucaki commented 1 year ago

Thank you for the quick response.

My system can indeed recapitulate that vignette using the DuoClustering2018 library.

I noticed another oddity with this bug when using "mu=cell_type + offset(log(library))". I receive the "non-numerical argument" error immediately after "Start marginal fitting". However, if I bump up "n_cores" from 1 to 2, it goes past the "marginal fitting" step very quickly and then gives the following error at the "Convert Residuals" step:

Convert Residuals to Multivariate Gaussian
Error in `colnames<-`(`*tmp*`, value = rownames(sce)) : 
  attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning messages:
1: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule,  :
  all scheduled cores encountered errors in user code
2: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule,  :
  all scheduled cores encountered errors in user code

This again does not happen when I specify the specific library variable "colData(sce)$library" rather than just "library" (however, there is a warning at the end of the run, see my question below). I will send you example data directly soon.

I had two other related questions.

  1. In 'scDesign2', there was an "auto-choose' option where the software itself selected the gene distribution. That option has been removed from scDesign3. Is there any reason for this?

  2. When my run completes (creating simulated data adjusting for cell type and library size, with a NB distribution and a gaussian copula), the data created looks good but the run always ends with the following warnings:

Warning message:
In chol.default(sigma, pivot = TRUE) :
  the matrix is either rank-deficient or indefinite

Is this anything we should be worried about? Is there a way to avoid this warning?

Thank you!

SONGDONGYUAN1994 commented 1 year ago

Hi, Through the email communication, the bug comes from: for any covariates you would like to use (e.g., library), you need to specify it in the para other_covariates.

For the other two questions:

  1. For "auto-choose', there are two main reasons: a. Computational time. The auto-choose will increase the computational time roughly four times (since you need to fit each distribution separately). b. For current single-cell data, there are some conclusions about the optimal distribution. For example, multiple papers point out that NB is good enough for UMI data (including our paper,). Therefore, I think users can decide on the distribution based on existing knowledge.

  2. For the warning the matrix is either rank-deficient or indefinite: it means your cell number is smaller than gene number, thus the correlation matrix is indefinite. It may cause the estimation inaccurate, but the best solution is to increase the cell number.