HelenaLC / muscat

Multi-sample multi-group scRNA-seq analysis tools
165 stars 33 forks source link

DP simulated data has inverted logFC and only one bimodal group #125

Open ChristophH opened 1 year ago

ChristophH commented 1 year ago

Hi,

Thanks for creating and maintaining muscat, and all your hard work that goes into method development and benchmarking!

I think there is a bug/inconsistency in the sign of the logFC of simulated data. When simulated with the 'dp' category, the logFC is applied in the opposite direction as, for example, with the 'de' or 'dm' category.

In .sim() you can see how either '-lfc', or 'lfc' is used with group 1

https://github.com/HelenaLC/muscat/blob/10cd3262bd5bc45930d62dc8f7d8b97629c3bbe4/R/utils-simData.R#L329-L352

But maybe I'm missing something.

Cheers, Christoph

HelenaLC commented 1 year ago

Great question. Yeah, so it’s admittedly been a while since writing this, but looking at the code & comments, I think the reasoning is at follows…

ChristophH commented 1 year ago

Thanks for the explanation. However, I was just trying to validate that behavior using the .sim function with some toy parameters, but could not reproduce the DP density curves. image [Crowell et al., Nat Com, 2020]

For DE and DM the curves look as expected. But for DP, only group A shows the bimodal distribution. As a result, the observed logFC is always negative, even when the input logFC is positive.

image

(For details see this R notebook)

I understand that .sim is an internal function, but I'm going down this rabbit hole as I'm trying to understand why my muscat generated data for DP genes does not look as expected.

Edit: Now I've used muscat::simData to generate data and illustrate the problem. The R notebook is here. The figure below shows one high and one low logFC gene per differential category. DP shows opposite logFC sign compared to DE and DM. Also, DP looks like DM with just one bimodal distribution.

image

Am I doing something wrong, or missing something?

ChristophH commented 1 year ago

Short recap:

  1. Genes generated under the DP model show unexpected logFC (observed logFC is almost always opposite direction of the logFC of the gene_info table in the simulation metadata)
  2. Genes generated under the DP model do not show bimodal distribution in both groups (just in one group, this explains the issue above)

To me, this seems like unexpected behavior (a bug?), or am I doing it wrong? Please advice.