Open janursa opened 4 weeks ago
No problem. If y is observed expression of all genes under a perturbation, yhat is the prediction for all genes under that perturbation, and x is expression of all genes with no perturbation, we are computing the correlation between yhat-x and y-x. The code doing that is here.
thanks for the prompt answer, Eric. so, you calculate corr. for each sample (perturbation) separately. have you tried the metrics that is designed for multi-target predictions? for example, r2 scores where it receives n_sample*n_genes? this is important because in the datasets that you have used, the variance within samples are more dominant than between samples. consequently, if you take the mean across samples and calculate corr. within samples, you will always get a high correlation. it's almost trivial that the mean approach would beat off the others as it gets most strong information of your dataset that other models are not getting. no?
I see your point that correlation has a very different baseline expectation depending on whether it's computed within each sample or within each gene. We haven't tried your exact suggestion yet, but we have used squared error and absolute error. Those come out the same whether they are computed row-first or column-first.
Regarding the info that the methods get:
it's almost trivial that the mean approach would beat off the others as it gets most strong information of your dataset that other models are not getting. no?
All of the methods receive access to the full training data, plus the post-perturbation expression of the gene or genes that were perturbed in each sample. The mean is implemented as just another regression method in the code too; it receives the data in the same type of function call as does e.g. LASSO or ridge regression. The code is not as clean as I'd like but that happens here.
Thank you, Eric, for the thoughtful response. If you calculate the R^2 scores for the mean baseline across n_samples * n_genes, you’d likely get a result near zero. Other methods might perform worse, but the mean approach isn’t particularly effective either; it’s simply a matter of the metric used and the specific datasets that make it seem particularly strong.
Regarding your regression setup, please let me know if I’ve understood correctly. My understanding is that you use the mean expression data from the controls and adjust it to reflect the perturbed gene (either setting it to zero for knockouts or to the observed value for knockdowns and chemical perturbations). You then use this as the feature space for regression to predict the perturbed expressions (the log fold change). The issue here is that your control expressions are constant across the feature space, with only the perturbed genes as the varying elements. Is that right?
To verify, it might help to try an approach that discards the control expressions, setting only the perturbed genes as the feature space—like a one-hot encoding (if you have tried this already, I’d be interested to know the outcome). So, if my understanding is correct, you’re expecting the model to predict the perturbation outcomes based on the source of perturbation alone. This would indeed be a tough challenge for a regression model unless there are enough combinatorial perturbation settings for the model to learn effectively.
Thanks for taking an interest in our work! I am really happy to be having these discussions. It's a complicated area and I am sure our work will benefit from more perspectives.
We have not tried this one-hot encoding idea exactly. The closest we have come is including certain methods whose predictions do not depend on the starting expression state: GEARS and the mean/median/empty network baselines.
There is one complication, which is always hard for me to express properly. What you describe is always the case when we go to make predictions: features are all matching the control expression, except for entries where that gene was directly targeted in that sample. But during training, we offer the user a choice for feature construction. One option is to pick a control sample and modify only the perturbed gene, similar to how predictions are made. The other option is to use the training data as-is under a steady state assumption, which would potentially have a lot more informative variation in the feature matrix. The user makes this choice via the parameter we call matching_method
.
My intuition is that these choices estimate different parameters: option one estimates total or long-term effects, and option two estimates direct or instantaneous effects. Since the predictions require total or long-term effects, we also follow CellOracle in exposing a parameter we call prediction_timescale
that allows the user to predict F(F(... F(X)...)) instead of just F(X). This is what we experiment with in figure 3's rightmost column. It's still not totally clear to me how this should be handled and I would definitely consider experimenting with it further.
Re The other option is to use the training data as-is under a steady state assumption, which would potentially have a lot more informative variation in the feature matrix.
: can you elaborate on this? you use the training data in feature construction with some kind of target encoding to take into account the targeted gene or how? btw, would it be possible for you to share the single cell data of replogle2? if so, pls kindly write me jalil.nourisa@gmail.com
RE: target encoding for the "steady state" matching scheme, we train a model to predict each gene's expression from the other genes' expression. We use the following procedures to account for perturbations:
RE: replogle2, I'm sorry to say I do not have the single cell data for replogle2. In this project we only worked with the pseudo-bulk profiles. This prevented us from running GEARS on replogle2, which is unfortunate. But, even though you cannot train GEARS without scRNA data, the predictions from GEARS are the same across any group of cells with the same set of genes perturbed. So no essential information is lost.
Thanks, Eric, for the clear explanation. I think the models should be more successful in making predictions in this version, compared to just controlling for and adjusting the perturbed genes alone, no? This should be especially true for single-cell predictions, as not every cell is expected to show comparable expression under perturbation
I was not sure what to expect a priori, except that I would expect the steady-state matching to go better with longer prediction timescales and the non-steady-state to go better with short prediction timescales. Our experiments showed that a non steady-state matching method worked a little better overall, and the short timescales were always better than long timescales, even for experiments using steady state matching. I would be interested to try out more models built around an explicit timescale, so you can train it while saying "here's the outcome from 6 days of continuous perturbation". I think Bicycle is like this but I have not gotten a chance to test it. The experiments were very limited and I am not sure there is really any convincing signal of a biologically useful network model. We may end up learning that single-gene perturbation effects are mostly too weak to train and validate this type of model, and we may need training data with more dramatic biological variation.
I’ve also observed that the datasets you’ve integrated so far generally show a relatively low perturbation effect, making it challenging for a model to capture effectively. This issue becomes even more pronounced with pseudobulking. If you take single-cell data and compare the within-group versus between-group variation before and after pseudobulking, you’ll likely see a substantial decline. This suggests that the perturbation effect is not uniformly strong across all cells but rather concentrated in a subset of them, which gets smoothed out during pseudobulking.
On another note, would you be interested in joining us for our GRN benchmarking project? If so, please feel free to send me a direct message so we can keep this conversation on-topic here.
Regarding the pseudobulking: that does make sense. I have had trouble with gRNA quantification in the past when analyzing perturb-seq all the way from raw reads, and I suppose this is why there are methods like MIMOSCA. I have not studied this issue as much as I would like.
Regarding the collaboration, thank you and I will send you a message!
Hi Eric, I couldnt manage to send you a direct message on your github profile so i am writing my question here. pls feel free to move it somewhere else. I observe that in your paper, you have shown that the mean approach dominantly outperforms the other methods. I am curious about this. I see that you have shown a strong corr between baseline prediction (mean of the training data) and the actual fold change (spearman > 0.6). i wonder, considering that your baseline prediction has the same values for all samples of a given gene, how do you calculate the spearman correlation on this 2D space?is it something like this?
corr, p_value = spearmanr(y_true.flatten(), y_pred.flatten())
. would be kind of you to direct me to your actual code doing this.