Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
217 stars 57 forks source link

Simulation of specific patterns of genes along the "Step" when simulating paths. #170

Closed spriyansh closed 7 months ago

spriyansh commented 7 months ago

Hello Splatter team,

Thank you for creating an excellent tool and actively maintaining it. I want to simulate genes which change their expression as a function of "Step" when performing the simulation using the "Path" method. The goal is to generate sets of genes with specific trends with "Step". Considering the following setup:

graph LR;
    A-->|Path-1|B;
    B-->|Path-2|C;
    C-->|Path-3|D;
    D-->|Path-4|E;

Based on what is described in Issue #57, I expect a gene with an overall increasing trend to exhibit an increasing trend across all paths and to have an increasing fold-change in successive paths.

I calculated the change in expression for each path and their "localDEFac" as follows:

## Extract the expression truth
ground_truth_all <- rowData(sim_sce) %>% as.data.frame()

# Rename the columns
ground_truth_local <- ground_truth_all[, c("Gene", "BaseGeneMean", "LocalDEFacPath1", "LocalDEFacPath2", "LocalDEFacPath3", "LocalDEFacPath4")]
colnames(ground_truth_local) <- c("Gene", "Base", "Path1", "Path2", "Path3", "Path4")

# Calculate  expression
ground_truth_local$Path1LocalExp <- ground_truth_local$Base * ground_truth_local$Path1
ground_truth_local$Path2LocalExp <- ground_truth_local$Path1LocalExp * ground_truth_local$Path2
ground_truth_local$Path3LocalExp <- ground_truth_local$Path2LocalExp * ground_truth_local$Path3
ground_truth_local$Path4LocalExp <- ground_truth_local$Path3LocalExp * ground_truth_local$Path4
ground_truth_local <- ground_truth_local[, c("Gene", "Base", "Path1LocalExp", "Path2LocalExp", "Path3LocalExp", "Path4LocalExp"), drop = FALSE]

Now for the following genes

ground_truth_local[c("Gene79", "Gene7", "Gene6", "Gene38"),]
>         Gene       Base Path1LocalExp Path2LocalExp Path3LocalExp Path4LocalExp
Gene79 Gene79 1.07266298   2.028732808  26.686113586  4.938131e+01  1.969886e+02
Gene7   Gene7 0.06713981   0.007306483   0.002736674  1.389706e-03  3.796306e-04
Gene6   Gene6 1.27780572  19.856999074  41.111765669  1.365681e+02  6.363990e+02
Gene38 Gene38 2.28456347   4.237341176   4.994231232  6.597174e+01  1.058036e+02

Based on their expression, I would expect different patterns. For example:

However, with such calculations, it's challenging to confidently identify the exact trend for all genes. I think this approach only holds for genes with a decreasing trend along the "Steps." The calculated spline does not accurately follow the expected trend. image

Do you think it's possible to simulate such data in Splatter? I am interested in simulating/filtering genes that follow steps like increasing, decreasing, convex, and concave patterns. I would appreciate your guidance in the right direction. Thank you in advance.

Additional Info:

  1. I set "path.nonlinearProb" and "path.sigmaFac" to 0 to avoid generating any non-linear patterns within a path. Instead, I generate 4 paths, one after the other, to observe different patterns.
  2. The Step is modified, and the steps of each successive path are added to the previous path's steps to behave like pseudotime.
  3. I have attached a script with the plotting function and the simulation parameters example_script.txt.
lazappi commented 7 months ago

Hi @spriyansh. Thanks for the interesting analysis. I think you might be pushing the limits of what the simulation is designed to do but I can't see any obvious problems with what you have done and it should produce something like the effect you are looking for. Can you please trying plotting the BaseCellMeans or CellMeans assay instead? That will remove some of the other steps which might be confusing things. Setting lib.scale = 0 might also help.

Is there a reason you close to do it this way rather than using the variation built into the paths? You should be able to get something pretty similar that way.

spriyansh commented 7 months ago

Hi @lazappi, thank you for the prompt response.

I want to keep track of the changes in the paths along the steps. That's why I put one path after another, so I can know precisely when the change happens in a path. Correct me if I'm wrong, but based on the documentation, I believe setting the path.nonlinearProb and path.sigmaFac will surely introduce local variations in a path. However, the record of where the change is introduced is not kept; only the magnitude and probability of the change are. For my specific analysis, I want to know the locations of the change. The goal is to simulate and annotate trends like increases, decreases, convex and concave shapes, etc.

Also, the parameters that I am using for the simulation are learned from the 10x V2 dataset published by Setty et al., 2019. This dataset is produced using a snapshot experiment of differentiating cells. Here is a quick summary of the non-default parameters learned from a real dataset:

Batches: 
    [BATCHES]  [BATCH CELLS]     [Location]        [Scale]       [REMOVE] 
            1            200            0.1            0.1           TRUE 

Mean: 
          (RATE)           (SHAPE) 
2.85866981975563  3.33448187687623 

Library size: 
       (LOCATION)            (SCALE)             (Norm) 
 9.03025915193446  0.364613373259578              FALSE 

Exprs outliers: 
    (PROBABILITY)         (LOCATION)            (SCALE) 
                0   2.19848115217974  0.966580327002644 

Groups: 
     [GROUPS]  [Group Probs] 
            1              1 

Diff expr: 
[PROBABILITY]    [Down Prob]     [LOCATION]        [SCALE] 
            1            0.5              1              1 

BCV: 
    (COMMON DISP)              (DOF) 
0.153925836596722   39.7270706976494 

Dropout: 
            [TYPE]          (MIDPOINT)             (SHAPE) 
        experiment  -0.314888109220462   -1.29948251874868 

Paths: 
        [From]         [STEPS]          [Skew]    [NON-LINEAR]  [SIGMA FACTOR] 
             0              50             0.5               0               0

Since the idea is to produce a simulation as close as possible to real data, is it a good idea to set lib.scale = 0?

Using the above parameters, I create plots for a gene expressed in around 5000 cells (script attached). As you suggested, I plotted the CellsMeans; indeed, the patterns are more evident now.

 > Gene    Base Path1LocalExp Path2LocalExp Path3LocalExp Path4LocalExp
Gene50 Gene50 1.72121     0.7448674     0.1353784     0.2917269      1.667929

image

Do you think this approach is robust enough to be able to annotate patterns like this, or is there something else I can manage with path.nonlinearProb and path.sigmaFac to introduce and track changes in a path? As stated above, the goal is to simulate and annotate trends like increases, decreases, convex and concave shapes, etc. example_script.txt

lazappi commented 7 months ago

I want to keep track of the changes in the paths along the steps. That's why I put one path after another, so I can know precisely when the change happens in a path. Correct me if I'm wrong, but based on the documentation, I believe setting the path.nonlinearProb and path.sigmaFac will surely introduce local variations in a path. However, the record of where the change is introduced is not kept; only the magnitude and probability of the change are. For my specific analysis, I want to know the locations of the change. The goal is to simulate and annotate trends like increases, decreases, convex and concave shapes, etc.

Yes, I think you are right. The information about the base expression at each step along a path currently isn't stored in the file object which makes it difficult to check exactly how individual genes are behaving.

Since the idea is to produce a simulation as close as possible to real data, is it a good idea to set lib.scale = 0?

This was just a suggestion to make it easier to see what was happening with genes along the path. For a real simulation you would definitely want variation in the library size.

Do you think this approach is robust enough to be able to annotate patterns like this, or is there something else I can manage with path.nonlinearProb and path.sigmaFac to introduce and track changes in a path? As stated above, the goal is to simulate and annotate trends like increases, decreases, convex and concave shapes, etc. example_script.txt

If you are only simulating linear changes it should be enough to look at the recorded path DE factors as the ground truth of how a gene changes (like you did in the original post). However, these changes might not be visible in the final counts due to noise introduced by other parts of the simulation.

If it would be helpful, it would be a relatively minor change to add the factors for every step along the paths to the exported object. This only really provides more information for non-linear changes though.

spriyansh commented 7 months ago

Thank you for the response; it answers all of my questions.