JINJINT / ESCO

Single cell simulator with gene co-expression
GNU General Public License v3.0
5 stars 2 forks source link

Providing correlation matrix from dataset #7

Open ljschumacher opened 1 year ago

ljschumacher commented 1 year ago

Reading the ESCO paper I was under the impression I could provide a real dataset and simulate data with correlation (in the copula) as estimated from the dataset. However it is not clear from the documentation how to do that, could you clarify?

Looking into the code, I can also see that when type is 'type=="traj"(https://github.com/JINJINT/ESCO/blob/bf6d78c653dd06a38e092611265a76286ba6dfef/R/esco-simulate.R#LL1083C10-L1083C22), the parametercorris either expected to be a scalar (not a correlation matrix), or estimated usingrandcor` for the differentially expressed genes. Two things that are different to my expectations:

1) One can only specify one correlation matrix, not one per cell type 2) In the randcor function https://github.com/JINJINT/ESCO/blob/bf6d78c653dd06a38e092611265a76286ba6dfef/R/utils.R#L150, a "purified gene expression dataset" is used (https://www.eurekalert.org/pub_releases/2017-11/sfn-nwa111417.php), rather than a user-defined dataset. Is there a way to change this to a dataset of my choice?

JINJINT commented 1 year ago

Hi Linus,

Thanks for reaching out! ESCO can simulate data with correlation as estimated from the dataset, using the parameter "corr". The parameter "corr" is a list of correlation matrices, you can define it by yourself via

===if simulate one group

sim <- escoSimulateSingle(nGenes = 100, nCells = 50, lib.loc = 7, withcorr = TRUE, verbose = FALSE, corr=list(cormat))

===if simulate two groups

sim <- escoSimulateGroups(nGenes = 200, nCells = 100, group.prob = c(0.6, 0.4), deall.prob = 0.3, de.prob = c(0.3, 0.7), de.facLoc = c(1.9, 2.5), withcorr = TRUE, corr = list(cormat_housekeep, cormat_1, cormat_2), trials = 1, verbose =FALSE)

One just needs to make sure that:

nrow(cormat_housekeep)=length(housekeep genes);

nrow(cormat_1)=length(marker genes for group1); nrow(cormat_2)=length(marker genes for group2)

the easiest way to make sure of this is simulating data first without specifying corr, and check the automatically generated corr dimensions: slot( metadata(sim)$Params,"corr")

The randcor function is just for a convenient purpose: it uses a realistic dataset to generate a correlation structure automatically for users who do not want to specify the correlation structure by themselves.

The vignettes here https://github.com/JINJINT/ESCO/blob/bf6d78c653dd06a38e092611265a76286ba6dfef/vignettes/esco.Rmd contain more examples.

Best, Jinjin

On Thu, May 11, 2023 at 8:42 AM Linus Schumacher @.***> wrote:

Reading the ESCO paper I was under the impression I could provide a real dataset and simulate data with correlation (in the copula) as estimated from the dataset. However it is not clear from the documentation how to do that, could you clarify?

Looking into the code, I can also see that when type is 'type=="traj"( https://github.com/JINJINT/ESCO/blob/bf6d78c653dd06a38e092611265a76286ba6dfef/R/esco-simulate.R#LL1083C10-L1083C22), the parametercorris either expected to be a scalar (not a correlation matrix), or estimated usingrandcor` for the differentially expressed genes. Two things that are different to my expectations:

  1. One can only specify one correlation matrix, not one per cell type
  2. In the randcor function https://github.com/JINJINT/ESCO/blob/bf6d78c653dd06a38e092611265a76286ba6dfef/R/utils.R#L150, a "purified gene expression dataset" is used ( https://www.eurekalert.org/pub_releases/2017-11/sfn-nwa111417.php), rather than a user-defined dataset. Is there a way to change this to a dataset of my choice?

— Reply to this email directly, view it on GitHub https://github.com/JINJINT/ESCO/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKGMHEG37EZUN6HKQGLDRETXFTNDDANCNFSM6AAAAAAX6CV434 . You are receiving this because you are subscribed to this thread.Message ID: @.***>