flatironinstitute / inferelator

Task-based gene regulatory network inference using single-cell or bulk gene expression data conditioned on a prior network.
BSD 2-Clause "Simplified" License
47 stars 12 forks source link

shuffle_prior_axis #56

Closed torgeihv closed 1 year ago

torgeihv commented 2 years ago

What does the parameter shuffle_prior_axis actually do? I expected this to help me decide if the precision-recall curve I get from "real data" is better than what I could expect by chance. But setting the shuffle_prior_axis equal to 0 or 1 seems to always give me a very low area under the precision-recall curve, which doesn't fit with tests I did shuffling the labels of the prior outside of Inferelator. I also tried high prior_weights, which I expected to increase AUC when shuffling, but it didn't.

Also, wouldn't you expect shuffling the regulator labels to do nothing? The algorithm is not using regulator expression data (I think?), so shuffling the regulator labels would just rename the regulators with no consequence (?).

Of topic: I installed the new version of Inferelator (I think). How can I check which version I'm running?

asistradition commented 2 years ago

shuffle_prior_axis reorders the labels on the prior knowledge matrix - the functional effect is that it disconnects the calculated TF activity from the underlying biology, and you can then expect to recover edges only by chance. Your expectation is therefore correct, and the very low AUPR is an expected outcome.

I am not sure why you'd get a different outcome from shuffling labels outside the inferelator, unless you then used the same shuffled connectivity matrix as your gold standard for scoring. When you set shuffle_prior_axis, only the prior knowledge matrix is shuffled, and is scored against an unshuffled gold standard. You would expect the same low AUPR as a negative control from shuffling regulators, as well - again, because they're shuffled in the prior knowledge matrix, and not in the gold standard.

You should be able to find your current version with inferelator.__version__, or you can check your environment with pip or conda.

torgeihv commented 2 years ago

Thanks. Yes, I use the prior also as the gold standard. That explains my observations.

asistradition commented 2 years ago

If you're looking to benchmark, I would strongly suggest using holdouts. It's difficult in general to benchmark methods without using some piece of information that was kept out of the information provided to the algorithm in the first place.

This is how I usually do holdouts for crossvalidation - it'll use 20% of the genes in the gold standard for scoring, and remove the information in those genes from the prior knowledge that goes in (so you're only scoring on genes which are novel to the algorithm's results). I find that this is the only reliable way to get an estimate of how well (any) inference method is performing.

from inferelator import inferelator_workflow, CrossValidationManager

...

worker = inferelator_workflow(regression="bbsr", workflow="tfa")
worker.set_crossvalidation_parameters(split_gold_standard_for_crossvalidation=True, cv_split_ratio=0.2)

...

cv_wrap = CrossValidationManager(worker)
cv_wrap.add_gridsearch_parameter('random_seed', list(range(1,11))
cv_wrap.run()
torgeihv commented 2 years ago

Thanks! Related to "overfitting": I've played around with the parameter "bsr_feature_num (int) – The number of features to include in best subset regression. Defaults to 10."

In the MSB paper it says: "For a given target gene i, potential predictors are those TFs that have a known regulatory effect on i, and the ten TFs with highest time-lagged CLR"

So do bsr_feature_num only limited the number of added CLR TFs? All prior TFs are always included in the regression model?

asistradition commented 2 years ago

bsr_feature_num affects the number of features used in total - it will only add high-scoring CLR edges to the prior knowledge edges until you reach this maximum. You can alter this behavior with worker.set_regression_parameters(clr_only=True), which will ignore the prior knowledge network and only use CLR to select features.

This is an unfortunate limitation, as best subset regression is O(2^n), so runtime gets out of hand fast. All prior TFs influence the design matrix which expression is regressed against though, through calculation of the transcription factor activity.

This is not a parameter to change if you're trying to control regularization - regularization uses a weighted model selection metric (its a g-prior where the null model is zero coefficients). If you'd like a more sparse output network you should decrease the weights, and if you'd like a less sparse output network you should increase the weights (which will push the model toward maximum likelihood and away from the null model). You can do this with worker.set_regression_parameters(prior_weight=1, no_prior_weight=1), where 1 is the default weight. The functionality exists to weight edges in the prior knowledge network more or less heavily than edges not in the prior knowledge network. I wouldn't personally advise weighting known edges differently than non-edges - the prior knowledge network already influences the model enough without putting a finger on the scale.

If you have a large enough number of observations to make a stability metric viable, workflow.inferelator_workflow(regression="stars") is an option that should have reasonable performance without having to set model hyperparameters (as it's only really got one, and it is set internally based on a stability metric).

torgeihv commented 2 years ago

Thanks! Another concept I struggle with is the confidence score: "The final confidence score of an interaction is defined as the mean of the normalized rank across all bootstraps, where at each bootstrap, interactions are ranked by variance explained." I cannot find a definition of the "normalized rank".

Would I be correct to assume that a score of 0.6 means that the variance explained across bootstrap samples rank on average in the top 40% of all interactions?

asistradition commented 2 years ago

You're mostly correct in your interpretation. The variance explained gets rank-summed for all of the bootstraps & then the resulting sum is linearly normalized so that 1 is the highest possible rank (this edge is at the top in all of the bootstraps) and 0 is the lowest possible rank (this edge has no evidence in any bootstrap).

The end result for this is that you're largely correct, except A) it's rank on average in the top 40% of all interactions for which there is any evidence it might be a regulatory edge, and B) you need have enough bootstraps to smooth out the noise from the bootstrapping rankings.

The reason that this linear normalization was chosen over a percentile-based normalization is that it with enough bootstrapping it mostly comes out the same, but a percentile-based normalization can hide technical issues (like not enough bootstrapping to get good estimates of rank position), and the linear normalization should make them pretty obvious.