egr95 / R-codacore

An R package for learning log-ratio biomarkers from high-throughput sequencing data.
Other
21 stars 3 forks source link

Methods for increasing stability of selected balances? #13

Open nick-youngblut opened 2 years ago

nick-youngblut commented 2 years ago

I'm running codacore on a dataset of ~300 samples. I'm conducting binary classification with default parameters, and if I just run the codacore() function multiple times (nothing else changing), I get very different balances and sometimes no balances. The AUC is generally ~0.87 on train and ~0.8 on test, so the model seems to be predictive. Sill, I'm worried about the instability of the balances in regards to analysis reproducibility and also biological interpretation. Any suggestions?

egr95 commented 2 years ago

Did you set a seed (see Section 3 from the Guide)? Doing so should ensure you get the exact same log-ratio, but if this is not the case please do share a snippet and I will investigate the issue (note that internally Codacore does some cross-validation, involving random subsampling, to determine the complexity of the model, i.e., the number of log-ratios and the number of covariates included in each).

That said, you do raise a good point in the sense that I should probably have exposed some kind of seed argument in the main codacore function to make this explicit to the user. Unless you have a better suggestion, I will add a randomState argument (akin to scikit-learn).

nick-youngblut commented 2 years ago

I did not set a seed, since I wanted to assess the output variability of codacore(). I'm glad I did so, since I can get similar AUCs with very different balances. This is really important for biological interpretation, especially in regards to designing wetlab experiments based on that data. It would be tragic to focus months/years worth of experiments on one set of balances, just to find out that a different seed would have set the focus on >=1 completely different balances.

Of course, researchers shouldn't rely on just one analysis, but as seen in the literature, this is often the case.

I'm considering running codacore() many times (eg., n=100) with no seed set, and then compiling the set of non-identical balances that produce "acceptable" test AUCs. This would help reveal all possible balances that are comparably predictive of the target variable. Does that make sense?

nick-youngblut commented 2 years ago

Attached is an example of running codacore on the same data 100 times. 31 of the 100 replicates generated predictive balances (length(model$ensemble) > 0). The plot shows how taxonomically variable the resulting balances are.

codacore_results

nick-youngblut commented 2 years ago

In regards to running codacore() repeatedly, the function "stalls" when I try to run it in parallel. I've tried doParallel and clustermq, and both result in the codacore() jobs never completing. I can run tensorflow functions in parallel, so that doesn't seem to be the issue.

egr95 commented 2 years ago

Thanks for the comments and for sharing the cool graphics! A few things to unpack here:

  1. With regard to the repeated runs, this is very similar to the analysis in Section 4.2 from our manuscript. In general, you should expect that for datasets with (a) high number of samples, (b) low number of input variables, (c) low multicollinearity between these inputs, the learned balances will be more stable, whereas datasets with (a) low number of samples, (b) high number of input variables, (c) high multicollinearity between the inputs, will typically yield more variability in the learned balances. Note that the example I referred to from our manuscript is run on a dataset with large n (~1,000) and low p (~50), which is partly why the balances found were very stable in that case. However, for high-dimensional data with significant multicollinearity, there will inevitably be a larger set of potential balances with good predictive accuracy. Note this difficulty applies not just to codacore, but more generally across predictive models of high-throughput sequencing data.
  2. Perhaps the most important hyperparameter in codacore is lambda, which sets the regularization strength. The default lambda=1 is quite conservative, meaning it will only return a balance if there is strong evidence that this balance will be predictive out-of-sample. If you find that codacore does not return any balances at all, this may mean you are over-regularizing the model, and smaller values of lambda (say lambda=0) could be worth trying out.
  3. Regarding parallelization, I have only used codacore by running Slurm jobs from a bash script. I have not tried codacore with clustermq or doParallel, so I would have to take a look at a code sample/error messages to be able to help here.
nick-youngblut commented 2 years ago

A general question about which balances are learned: I don't see where the balance selection takes place in the code, so can you please point it out? I'm interested in forking codacore to modify how balances/amalgamations are selected.

egr95 commented 2 years ago

Hi Nick, my apologies for the very late reply here. I somehow missed the Github notification completely. In answer to your question (in case you didn't figure it out already), Codacore first learns a "soft" balance, which is then discretized by cross-validation. This discretization step produces the eventual balances that are returned by Codacore. In the case of the R package, the "soft" balance is learned in the function trainRelaxation.CoDaBaseLearner here, and the discretization happens in harden.CoDaBaseLearner here. In the python package, the "soft" balance is learned in the function train_relaxation here, and the discretization happens in discretize here.