Winnie09 / Lamian

39 stars 9 forks source link

Can Lamian handle the reduced one-sample case? #19

Open szcf-weiya opened 8 months ago

szcf-weiya commented 8 months ago

Hi Dr. Hou,

Thanks for your wonderful work! I am aware that Lamian is designed for multiple scRNA samples, but quite curious about whether it can handle the reduced one-sample case. In some scenarios, although there are multiple scRNA samples, due to some reasons, only the final integrated data is accessible, which needs to be treated as one sample.

I just did some experiment by creating dummy cellanno and design variables,

cellanno1 = data.frame(Cell = names(pseudotime1), Sample = paste0("s", rep(1, length(pseudotime1))))
design = data.frame(intercept = 1, group = 1)
rownames(design) = "s1"

and then perform TDE test via

lamian_test(expr = expr,
            pseudotime = pseudotime1,
            cellanno = cellanno1, 
            test.type = "Time",
            permuiter = 5,
            design = design, ncores = 1)

It stops with an error

non-finite value supplied by optim

After digging into the code, it should be due to the initialization of alpha

https://github.com/Winnie09/Lamian/blob/7d5a8ff053fa7b489e1f55d64d0175f3dfa98b75/R/fitpt.R#L146

when there is only one sample, the var returns NA.

A natural adaption might be to assign the variance of one sample as some fixed value. Do you think it is reasonable? Or if Lamian can handle the reduced one-sample case, what's the proper way? Thanks!

Winnie09 commented 8 months ago

Hi @szcf-weiya Thank you for your interest in our work. When you mentioned "integrated data", did you refer to the integration done by methods such as Seurat or Harmony? Is so, please refer to the following texts in our supplementary material:

"The difference between any two samples can be due to true biological differences or unwanted technical differences (e.g. systematic batch effects or random measurement noises). The goal of differential expression analysis is to identify true biological differences. In order to achieve this, there are two main steps in the multiple-sample scRNA-seq differential gene expression analysis. STEP 1 (sample integration): In this step, one integrates cells from different samples together so that cells of the same type are aligned across samples. This is to make sure that when we compare different samples, we are comparing oranges with oranges (e.g. T cells from patient 1 vs. T cells from patient 2) and apples with apples (e.g. B cells vs. B cells), rather than comparing oranges with apples (e.g. T cells from patient 1 vs. B cells from patient 2). This step is accomplished using existing approaches such as Seurat(CCA) and Harmony. These approaches aim to retain biological differences across cell types, but they remove both biological differences across samples and systematic technical differences such as batch effects. Even though some people call these methods “batch correction” methods, they are actually methods that remove both batch effects and true biological differences across samples to facilitate sample integration. For this reason, we prefer to call them “sample integration” methods rather than “batch correction” methods. These methods may report corrected gene expression values after sample integration (i.e. after removing sample-level differences), but these corrected expression values cannot be directly used to study differential expression across samples (e.g. healthy vs. COVID difference) because all true biological differences between samples are removed together with batch effects. STEP 2 (differential expression analysis with batch effect correction): After matching cells of the same type across samples, STEP 2 then analyses differential expression using the matched cells. Since one cannot use the corrected expression values from STEP 1 for the reasons explained above, one has to use the original uncorrected gene expression values to perform this analysis. This is also what the authors of Seurat recommended (see the "Identify conserved cell type markers" section: https://satijalab.org/seurat/articles/integration_introduction.html, first code chunk: "For performing differential expression after integration, we switch back to the original data"). This is not saying that STEP 1 is not useful. That step is indeed crucial since after STEP 1 we now know which cells are of the same type that we can meaningfully compare across samples."

szcf-weiya commented 8 months ago

Hi Dr. Hou, thanks for your prompt and kind explanation!

Actually, I have no information about the integration step of the data. What I obtained are the expression matrix and pseudotime vector, but I want to apply the Lamian approach to this data if it can handle the reduced one-sample case. In your paper, I noticed that you have compared with one-sample methods such as tradeSeq by pooling multi-samples into a single sample,

While there currently exist pseudotime analysis methods to detect changes in gene expression along pseudotime (e.g. Monocle20–22, TSCAN23, Slingshot24 ), in cell abundance along pseudotime (e.g. milo25, DAseq26 ), and in trajectory lineages (e.g. tradeSeq27 ), none of these methods investigate changes across conditions.

In contrast, tradeSeq (which is used by Slingshot) is a method originally developed for comparing different branches of a pseudotemporal trajectory within a single sample. Here, we tailored the function to compare the same branch in a pseudotemporal trajectory between two samples. Since tradeSeq does not consider cross-sample variability, cells from replicate samples are pooled and treated as if they come from a single sample.

In contrast, I am thinking in another way. Given a single sample, can/how Lamian work? Intuitively, one sample can be viewed as multi-samples under the same conditions if we assume that those multi-samples are replications of the one-sample.

Technically, the only obstacle seems to be the calculation of the variance of one sample,

https://github.com/Winnie09/Lamian/blob/7d5a8ff053fa7b489e1f55d64d0175f3dfa98b75/R/fitpt.R#L146

Mathematically, given one observation, the question is how to specify both alpha and eta for the inverse gamma distribution.

image

Firstly we can write their mean,

eta/(alpha-1) = sigma2_s

no higher order information, so both eta and alpha are not distinguishable. But if we fix one of them, or suppose the variance is a small constant value, say 1e-5, it seems ok.

Is that proper? Am I missing something? Much appreciate your comments.

clee700 commented 3 months ago

Hello, Just wanted to follow up on this comment and see if there were any updates? I am also interested in applying Lamian in a one sample case.

Thank you! Christy