simulation_type_knockdown adds more cells to KO

HectorRDB commented 3 years ago

Hi, I have been following the knockout vignette from the devel branch and I noticed a weird behavior If I run, with no effect (multiplier is 1)

  backbone <- backbone_bifurcating()
  model_common <-
    initialise_model(
      backbone = backbone,
      num_cells = 100,
      num_tfs = nrow(backbone$module_info),
      num_targets = 250,
      num_hks = 250,
      simulation_params = simulation_default(
        census_interval = 10, 
        ssa_algorithm = ssa_etl(tau = 300 / 3600),
        experiment_params = simulation_type_wild_type(num_simulations = 2)
      )
    ) %>%
    generate_tf_network() %>%
    generate_feature_network() %>% 
    generate_kinetics() %>%
    generate_gold_standard()

  model_wt <- model_common %>%
    generate_cells()
  b3_genes <- model_common$feature_info %>% filter(module_id == "B3") %>% pull(feature_id)
  model_ko <- model_common
  model_ko$simulation_params$experiment_params <- simulation_type_knockdown(
    num_simulations = 2,
    timepoint = 0, 
    genes = b3_genes,
    num_genes = length(b3_genes),
    multiplier = 1
  )
  model_ko <- model_ko %>%
    generate_cells()
  model_comb <-
    combine_models(list(WT = model_wt, KO = model_ko)) %>% 
    generate_experiment()

I don't get identical datasets:

table(table(str_detect(model_comb$simulations$meta$from, "WT")))

FALSE  TRUE 
  186   184

ggplot(df, aes(x = time, fill = str_detect(from, "WT"))) + geom_density()
ggplot(df, aes(x = sim_time, fill = str_detect(from, "WT"))) + geom_density(alpha = .5)

I am missing something? Best

rcannood commented 3 years ago

Hey Hector!

What do you expect to happen? That you get identical cell profiles as output? Or that you get the same number of cells from both WT and KO?

I assume you set the num_simulations to 2 for the purpose of the reproducible example. If not, you should probably increase the number of simulations.

Do the sections below answer your question?

Robrecht

Identical cell profiles

If you're expecting identical cell profiles: The simulations are still random. For instance, if I run the code above and visualise the model with plot_gold_mappings(model_comb), I get the following plot.

Code

```r library(tidyverse) library(dyngen) set.seed(1) backbone <- backbone_bifurcating() model_common <- initialise_model( backbone = backbone, num_cells = 100, num_tfs = nrow(backbone$module_info), num_targets = 250, num_hks = 250, simulation_params = simulation_default( census_interval = 10, ssa_algorithm = ssa_etl(tau = 300 / 3600), experiment_params = simulation_type_wild_type(num_simulations = 2) ) ) %>% generate_tf_network() %>% generate_feature_network() %>% generate_kinetics() %>% generate_gold_standard() model_wt <- model_common %>% generate_cells() b3_genes <- model_common$feature_info %>% filter(module_id == "B3") %>% pull(feature_id) model_ko <- model_common model_ko$simulation_params$experiment_params <- simulation_type_knockdown( num_simulations = 2, timepoint = 0, genes = b3_genes, num_genes = length(b3_genes), multiplier = 1 ) model_ko <- model_ko %>% generate_cells() model_comb <- combine_models(list(WT = model_wt, KO = model_ko)) %>% generate_experiment() plot_gold_mappings(model_comb) ```

Rplot

The first frame is the gold standard for both WT and KO, so there's no difference between them. The second and third frames are the two WT simulations, whereas the fourth and fifth are the KO simulations. With this specific seed, one WT simulation went to the end state on the left and one went to the right, whereas both KO simulations went to the right. If you increase the number of simulations, on average you will find the same kinds of cell profiles, but they will still be different, of course.

Same cell numbers

If you're simply expecting the same number of cells from KO as from WT, I've just pushed a commit to dyngen@devel that allows you to run the cell sampling experiment before combining the models. However, this will yield twice the number of cells specified by the num_cells parameter.

Example

```r set.seed(1) backbone <- backbone_bifurcating() model_common <- initialise_model( backbone = backbone, num_cells = 50, num_tfs = nrow(backbone$module_info), num_targets = 250, num_hks = 250, simulation_params = simulation_default( census_interval = 10, ssa_algorithm = ssa_etl(tau = 300 / 3600), experiment_params = simulation_type_wild_type(num_simulations = 2) ) ) %>% generate_tf_network() %>% generate_feature_network() %>% generate_kinetics() %>% generate_gold_standard() model_wt <- model_common %>% generate_cells() %>% generate_experiment() b3_genes <- model_common$feature_info %>% filter(module_id == "B3") %>% pull(feature_id) model_ko <- model_common model_ko$simulation_params$experiment_params <- simulation_type_knockdown( num_simulations = 2, timepoint = 0, genes = b3_genes, num_genes = length(b3_genes), multiplier = 1 ) model_ko <- model_ko %>% generate_cells() %>% generate_experiment() model_comb <- combine_models(list(WT = model_wt, KO = model_ko)) model_comb$experiment$cell_info$model %>% table ```

HectorRDB commented 3 years ago

Hey Robrecht, Thanks for the quick answer. Sorry, I should have been clearer on my first issue.

Yes indeed, I set 2 here so the code can run faster, I set it to higher values in practice.
I indeed have two issues but i think they are related:

The KO experiment always has more cells. Indeed, here, each category of model_comb$simulations$meta$simulation_i will have exactly 92 cells (if its' in WT) and 93 (if it's in KO). Moreover, this number seem to be a function of the census_interval interval argument, not num_cells.

This "extra" cell in KO always has a sim_time variable of 0, i.e. it's not at all random: the second plot I showed is identical if I set n_simulations to 100 or even 500. This means that the two datasets are not fully comparable. I've done checks comparing the two distributions of sim_time (and time since I am unsure which of those two variables should reflect the truth) when multiplier is 1 and the distribution of p-values is very much skewed toward zero.

I might be misunderstanding the various outputs from the simulation but this all seems weird to me. Thanks for the help

HectorRDB commented 3 years ago

Ok, I think I found the issue and it was on my end... I set the seed before all simulations, which explain why the time distribution was unmoving. Just changing the seed fixes the issue. This solves the imbalance issue. I still have the question of the number of cells. It seems to be a function of the census_interval interval argument, not num_cells. Thanks

rcannood commented 3 years ago

Good to hear you solved the issue about the time distribution.

With regard to the number of cells: the number of samples that are being generated throughout the simulation is equal to:

total_simtime <- simtime_from_backbone(model$backbone)
sample_interval <- model$simulation_params$census_interval
num_simulations <- nrow(model$simulation_params$experiment_params)

num_samples <- floor(total_simtime / sample_interval) * num_simulations

If num_samples < model$numbers$num_cells, then there will be only num_samples cells, otherwise the full amount will be sampled. Maybe I should let dyngen show a warning when this occurs. If it does, simply decrease the census interval a bit.

Can you confirm that this was indeed the case?

Robrecht

HectorRDB commented 3 years ago

Thanks!! So with the parameters from above, I get num_samples is 138, which is higher than model$numbers$num_cells (100).

I then generate the wild type (model_wt <- model_common %>% generate_cells() ) but this object has 178 cells, not 100 or even 138

nrow(model_wt$simulations$dimred)
[1] 178

Am I missing something?

rcannood commented 3 years ago

Hey @HectorRDB,

Sorry for the delay, I finally got around to taking a look at this problem.

I think dyngen was actually generating fewer cells than expected (in the first code example, we expect model$simulations$meta to have 200 rows since generate_experiment() is being called twice).

I solved a bug fix and pushed the fix to devel. I'll publish the fix on CRAN asap.

Does this solve your problem?

Robrecht

rcannood commented 3 years ago

Also, nrow(model_wt$simulations$dimred) should be much greater than num_cells (ideally, >×10). If it isn't, then the census_interval needs to be decreased or the number of simulations should be increased (See ?simulation_default).

HectorRDB commented 3 years ago

Hi @rcannood I won't have time in the next few days to look at it but I'll keep you updated when I get around to it. Thanks for checking it out though. best

rcannood commented 3 years ago

Hey @HectorRDB,

Do you still want to get back to this issue or shall I just close it?

Kind regards, Robrecht

HectorRDB commented 3 years ago

You can close it, I took a quick look and it indeed seem to work. Thanks a lot for the help

rcannood commented 3 years ago

:+1:

dynverse / dyngen

simulation_type_knockdown adds more cells to KO #26

Identical cell profiles

Same cell numbers