dynverse / dyngen

Simulating single-cell data using gene regulatory networks 📠
https://dyngen.dynverse.org
Other
73 stars 6 forks source link

documentation #17

Closed koenvandenberge closed 4 years ago

koenvandenberge commented 4 years ago

The framework is extremely useful for evaluating methods in the context of trajectory inference and gene regulatory networks. The default plots as part of the generate_dataset and related functions are interesting and provide a good summary of the simulated data.

There is a lot of information on those plots where I do not find documentation about, while they would be useful for accurate benchmarking. I apologize if I have overlooked where the documentation resides, but I did not find it.

A few examples, say dataset is a simulated object generated using generate_dataset, and model is the model used.

Thanks!

rcannood commented 4 years ago

Hi Koen!

There are indeed still a lot of undocumented features. I'll leave this issue open until I get around to going over all of the functions and outputs to see if the documentation is sufficient.

The plots show progression of submodules (e.g., A1 to A6). These relate to sets of transcription factors and may be retrieved using model$feature_network. It would be useful if we could also know for which cells these systems are most active. What do the variables strength, cooperativity relate to in model$feature_network?

In the devel branch, cooperativity has been renamed to hill. The strength parameter will determine the degree to which that regulator regulates a target in comparison to other regulators. The hill coefficient determines the steepness of the regulatory effect of a regulator w.r.t. its expression.

dataset$milestone_percentages provides a coarser resolution of that (i.e. assigns cells to module A/B/C/..). But it is unclear what exactly the percentage column is referring to.

dataset$progressions has columns cell_id, from, to, percentage. If one of its rows contains the values [C1, A, B, 0.6], then that would mean cell C1 has transitioned 60% from milestone A to B. There is a 1-to-N relation between a row from dataset$progressions and dataset$milestone_network. The corresponding values in dataset$milestone_percentages with columns cell_id, milestone_id and percentage would be [C1, A, 0.4] and [C1, B, 0.6].

For benchmarking purposes in the context of network inference, it would be useful to have a measure that summarizes the contribution of each previous module/TF, e.g. the expression in cell10 is mainly driven by modules X and Y, or TFs J and K. I assume that there should be some sort of decay function (e.g., cells in module A5 will be affected by modules A4, but also A3 and A2; can we summarize this in an appropriate measure?)

With simulation_default(store_grn = TRUE) you can calculate the effect of a regulator on a target. This was used to generate Figure 5 in the bioRxiv paper. Is this what you are looking for?

Kind regards, Robrecht

koenvandenberge commented 4 years ago

Many thanks, Robrecht! This has already cleared up a lot. I realize that writing documentation is a long, tedious, and not a very exciting process...

With simulation_default(store_grn = TRUE) you can calculate the effect of a regulator on a target. This was used to generate Figure 5 in the bioRxiv paper. Is this what you are looking for?

This might be what I need, let me check with you to be sure. It provides the GRN for each cell (I assume that this is the model$experiment$regulation object?). In this regulation object, rows are cells and columns are TF-target interactions. I assume that the elements in the matrix are strengths of the interaction. Would it be fair to say that the general activity/contribution of a particular TF (say, A1_TF1) in a particular cell (say, cell1) is the sum of all the columns of regulations that point from A1_TF1 to any target? For example, if the regulation matrix is as follows

      tf1_tox tf1_toy tf2_toz tf2_toy
cell1   0   3   6   9
cell2   1   4   7  10
cell3   2   5   8  11

The contribution of tf1 in cell1 would be 0+3=3, in cell2 it would be 1+4=5, etc.

If this makes it easier to look at, given a simulated dataset (small_linear in my case), this would be sample R code to achieve this:

# assume that the regulation object is the GRN in each cell.
regulation <- model$experiment$regulation
regulation_from <- strsplit(colnames(regulation), split="_")
# only look at TFs
keep <- unlist(lapply(regulation_from, "[[", 2)) %in% unique(model$feature_info$module_id)
regulation_from <- regulation_from[keep]
id <- unlist(lapply(regulation_from, function(x) paste(x[2:3], collapse="_")))
regulationFiltered <- regulation[,keep]

# calculate TF activity in each cell
R <- matrix(0, nrow=200, ncol=ncol(counts),
            dimnames=list(model$feature_info$feature_id, colnames(counts)))
uniqID <- unique(id)
for(ii in 1:length(uniqID)){
  curTF <- uniqID[ii]
  curCols <- which(id == curTF)
  if(length(curCols) < 1) next
  R[curTF,] <- rowSums(regulationFiltered[,curCols,drop=FALSE])
}
koenvandenberge commented 4 years ago

Hi,

Just checking in whether you've been able to look into the question above yet.

Thanks in advance and hope you're doing well. Best, Koen

rcannood commented 4 years ago

Hello Koen,

I'm really sorry, I must have missed the notification of your issue from Github!

In the meantime, you might have noticed on the dyngen@devel branch that store_grn was renamed to compute_cellwise_grn, and model$experiment$regulation was renamed to model$experiment$cellwise_grn. Sorry about that (but at least this should already answer your next question ;)). I've been trying to get dyngen and the dyngen manuscript prepped for submission, which is taking longer than it should, given the current situation.

This might be what I need, let me check with you to be sure. It provides the GRN for each cell (I assume that this is the model$experiment$regulation object?). In this regulation object, rows are cells and columns are TF-target interactions. I assume that the elements in the matrix are strengths of the interaction.

Exactly. As part of the simulation, the propensity of each reaction needs to be calculated; that is, the likelihood that the reaction will occur during an infinitesimal period of time. The propensity of the transcription reaction of a given gene depends on the abundance levels of the proteins of the gene's regulators. What is being stored in the model$experiment$regulation object is the difference in propensity if a TF is at its current abundance level (e.g. 10000) versus if there is none (0). To put it a little more formally, given that a cell with a particular state, the regulatory effect of tf1 on a target gene is:

regulation(target, tf1) = P(transcription_{target} | protein_{tf1} = 1000) - P(transcription_{target} | protein_{tf1} = 0)

If tf1 inhibits the target, this value should be negative, whereas otherwise it is positive.

Would it be fair to say that the general activity/contribution of a particular TF (say, A1_TF1) in a particular cell (say, cell1) is the sum of all the columns of regulations that point from A1_TF1 to any target?

Hm. That's an interesting question. It's a funny way of looking at the data that I had not foreseen. I guess this should indeed work. Take into account that if (for example) A1_TF1 upregulates B1_TF1, and if A1_TF1 eventually stops being expressed, and B1_TF1 as well, then over the whole history of the cell A1_TF1 will have resulted in a lot of B1_TF1 even though neither A1_TF1 or B1_TF1 are expressed anymore.

Cool idea! What are you planning on doing with it?

If this makes it easier to look at, given a simulated dataset (small_linear in my case), this would be sample R code to achieve this:

Since you posted this quite a while back, your code doesn't work anymore on my current installation of dyngen :(

I tried to interpret your code. Does this produce the result that you expect?

Expand code ```r library(tidyverse) library(dyngen) # generate a dataset backbone <- backbone_linear_simple() model <- initialise_model( backbone = backbone, num_tfs = 10, num_targets = 10, num_hks = 10, download_cache_dir = "~/.cache/dyngen", num_cores = 8, simulation_params = simulation_default( census_interval = 1, compute_cellwise_grn = TRUE ) ) %>% generate_tf_network() %>% generate_feature_network() %>% generate_kinetics() %>% generate_gold_standard() %>% generate_cells() %>% generate_experiment() # only look at TFs feature_info <- model$feature_info tf_info <- feature_info %>% filter(is_tf) # retain only the interactions between TFs feature_network <- model$feature_network %>% mutate(name = paste0(from, "->", to)) tf_network <- feature_network %>% filter(to %in% tf_info$feature_id) # fetch cellwise grn cellwise_grn <- model$experiment$cellwise_grn # this should be true assertthat::assert_that(all(colnames(cellwise_grn) == feature_network$name)) # retain only the interactions between TFs cellwise_grn_filtered <- cellwise_grn[, tf_network$name] # calculate TF activity in each cell trafo <- reshape2::acast(tf_network %>% mutate(value = 1), name~from, value.var = "value", fill = 0) %>% dynutils::expand_matrix(rownames = tf_network$name, colnames = tf_info$feature_id) assertthat::assert_that(all(colnames(cellwise_grn_filtered) == rownames(trafo))) R <- abs(cellwise_grn_filtered) %*% trafo R[1:5,] ``` Returns: ``` > R[1:5,] 5 x 10 Matrix of class "dgeMatrix" M1_TF1 M2_TF1 M2_TF2 M3_TF1 M3_TF2 M3_TF3 M4_TF1 M5_TF1 M5_TF2 M5_TF3 cell1 1.852856 0.2507472 0.1464237 0.09954539 0 0.02314177 0.57480947 0 0 0 cell2 1.870044 0.3174448 0.3349856 0.36279096 0 0.14165272 0.06419336 0 0 0 cell3 1.880757 0.4781971 1.0549135 0.23378798 0 0.13312466 0.08439904 0 0 0 cell4 1.900759 0.4253304 0.7792275 0.26639360 0 0.13186285 0.04116135 0 0 0 cell5 1.898433 0.2954491 0.2720817 0.26360398 0 0.16543618 0.12319928 0 0 0 ```

The code above will not make a distinction between activating and repressing activity, because otherwise (removing the abs(...) around cellwise_grn_filtered), if M1_TF1 upregulates one gene and downregulates another, these values will cancel each other out in R.

Kind regards, Robrecht

koenvandenberge commented 4 years ago

Hi Robrecht,

No worries, I completely understand. Your answer and accompanying code has provided exactly what I needed, thanks a lot for taking the time to do this! I will send an e-mail regarding how we are planning to use it.

Best, Koen

rcannood commented 4 years ago

I believe this issue to be solved, closing this issue.