how did you calculate the spearman corr?

janursa commented 21 hours ago

Hi Eric, I couldnt manage to send you a direct message on your github profile so i am writing my question here. pls feel free to move it somewhere else. I observe that in your paper, you have shown that the mean approach dominantly outperforms the other methods. I am curious about this. I see that you have shown a strong corr between baseline prediction (mean of the training data) and the actual fold change (spearman > 0.6). i wonder, considering that your baseline prediction has the same values for all samples of a given gene, how do you calculate the spearman correlation on this 2D space?is it something like this? corr, p_value = spearmanr(y_true.flatten(), y_pred.flatten()). would be kind of you to direct me to your actual code doing this.

ekernf01 commented 21 hours ago

No problem. If y is observed expression of all genes under a perturbation, yhat is the prediction for all genes under that perturbation, and x is expression of all genes with no perturbation, we are computing the correlation between yhat-x and y-x. The code doing that is here.

janursa commented 21 hours ago

thanks for the prompt answer, Eric. so, you calculate corr. for each sample (perturbation) separately. have you tried the metrics that is designed for multi-target predictions? for example, r2 scores where it receives n_sample*n_genes? this is important because in the datasets that you have used, the variance within samples are more dominant than between samples. consequently, if you take the mean across samples and calculate corr. within samples, you will always get a high correlation. it's almost trivial that the mean approach would beat off the others as it gets most strong information of your dataset that other models are not getting. no?

ekernf01 commented 20 hours ago

I see your point that correlation has a very different baseline expectation depending on whether it's computed within each sample or within each gene. We haven't tried your exact suggestion yet, but we have used squared error and absolute error. Those come out the same whether they are computed row-first or column-first.

Regarding the info that the methods get:

it's almost trivial that the mean approach would beat off the others as it gets most strong information of your dataset that other models are not getting. no?

All of the methods receive access to the full training data, plus the post-perturbation expression of the gene or genes that were perturbed in each sample. The mean is implemented as just another regression method in the code too; it receives the data in the same type of function call as does e.g. LASSO or ridge regression. The code is not as clean as I'd like but that happens here.

janursa commented 15 hours ago

Thank you, Eric, for the thoughtful response. If you calculate the R^2 scores for the mean baseline across n_samples * n_genes, you’d likely get a result near zero. Other methods might perform worse, but the mean approach isn’t particularly effective either; it’s simply a matter of the metric used and the specific datasets that make it seem particularly strong.

Regarding your regression setup, please let me know if I’ve understood correctly. My understanding is that you use the mean expression data from the controls and adjust it to reflect the perturbed gene (either setting it to zero for knockouts or to the observed value for knockdowns and chemical perturbations). You then use this as the feature space for regression to predict the perturbed expressions (the log fold change). The issue here is that your control expressions are constant across the feature space, with only the perturbed genes as the varying elements. Is that right?

To verify, it might help to try an approach that discards the control expressions, setting only the perturbed genes as the feature space—like a one-hot encoding (if you have tried this already, I’d be interested to know the outcome). So, if my understanding is correct, you’re expecting the model to predict the perturbation outcomes based on the source of perturbation alone. This would indeed be a tough challenge for a regression model unless there are enough combinatorial perturbation settings for the model to learn effectively.

ekernf01 / perturbation_benchmarking

how did you calculate the spearman corr? #3