egr95 / R-codacore

An R package for learning log-ratio biomarkers from high-throughput sequencing data.
Other
21 stars 3 forks source link

Access results when multiple balances are found #7

Closed johannesbjork closed 1 year ago

johannesbjork commented 3 years ago

When CoDaCoRe identifies several balances, I'm not certain how to access the different results, e.g. numerator/denominator of the last balance (which has the highest AUC).

If there is one single balance identified, I can simply

numerator_species <- colnames(otu_table(ps_codacore))[codacore::getNumeratorParts(model)]
denominator_species <- colnames(otu_table(ps_codacore))[codacore::getDenominatorParts(model)]

Thanks

egr95 commented 3 years ago

Thanks @johannesbjork!

The functions getNumeratorParts and getDenominatorParts take a second argument that can be used to select which of the (potentially multiple) balances learned you want to query. By default, it refers to the first balance, but if you wanted to access the second balance, you could run getNumeratorParts(model, 2), or in the context of your code: numerator_species <- colnames(otu_table(ps_codacore))[codacore::getNumeratorParts(model, 2)]

Alternatively, you can always run print(model) and this will show the components in all balances learned (though I appreciate that wouldn't combine easily with the expression you wrote above).

Note also that Codacore is a stagewise-additive model, also known as an ensemble. This is important in terms of its interpretation (section 4.2 from the manuscript offers an explanation). So for example, if Codacore identifies a total of 2 predictive balances, then the first balance can be thought of as the most predictive (typically with the highest AUC). The second balance won't have a higher AUC per se, but rather will allow you to obtain a higher AUC when combined with the first balance. Mathematically, the first balance is learned by trying to optimize the quality of the fit: y ~ B1. Then, once the first balance has been determined, the second balance is learned by optimizing the quality of the joint fit: y ~ B1 + B2, where B1 is now fixed (i.e., we only optimize over B2). Thus, the second balance is designed to identify the additional discriminatory power in the data after accounting for the first balance. In other words, the first balance can be interpreted in isolation, but the second should be interpreted in addition to the first, the third in addition to the first two, and so forth.

Hope this helps, let me know if any parts need more clarification!

johannesbjork commented 3 years ago

Thanks for a great response @egr95!

A follow up question: If the model finds two predictive balances, how do I can calculate (1) the AUC achieved by the 1st balance on the test data, and (2) the again in AUC on the test data by the inclusion of also the second balance?

Thanks, Johannes

egr95 commented 3 years ago

Thanks @johannesbjork , this is a good point! It wasn't directly obvious from the Guide how one might go about doing what you describe. So I have just added a parameter to the predict function that lets you specify the number of log-ratios (starting from the first log-ratio) that you want to include in forming the prediction (see Guide Section 3 as of writing). For example: yHat <- predict(model, xTest, logits=F, numLogRatios=1) pROC::auc(pROC::roc(yTest, yHat)) yHat <- predict(model, xTest, logits=F, numLogRatios=2) pROC::auc(pROC::roc(yTest, yHat))

Another (more manual) way you could go about doing this is using the getLogRatios function on your test data, which will return a matrix where the first column is the output of the top log-ratio (computed on the test data), the second column is the output of the second log-ratio, and so forth. The prediction is then a linear combination of these.

Let me know if there's anything else I can help with!

nick-youngblut commented 2 years ago

take a second argument that can be used to select which of the (potentially multiple) balances learned you want to query

How does one programatically determine the number of balances? I don't see that info in the object returned by codacore().

If this is known, the one could make a tidy table for the output. Example:

tidy_balance = function(baseLearnerIndex, model, xTrain){
    x = colnames(xTrain)[getNumeratorParts(model, baseLearnerIndex)]
    df = data.frame(X = 'Numerator', Y = x)
    x = colnames(xTrain)[getDenominatorParts(model, baseLearnerIndex)]
    df = rbind(df, data.frame(X = 'Denominator', Y = x))
    df$baseLearnerIndex = baseLearnerIndex
    return(df)
}

n_balances = 2       # <-- it would help to extract this info from `model`
1:n_balances %>%
    lapply(tidy_balance, model=model, xTrain=xTrain) %>%
    do.call(rbind, .)
egr95 commented 2 years ago

Thanks for the comment! The individual log-ratios (aka base learners) found by codacore are stored under model$ensemble, so the easiest way to do this would be: n_balances = length(model$ensemble)

Happy to add a function like getNumBalances(), and perhaps also a makeTidyTable() as you suggest.

nick-youngblut commented 2 years ago

Thanks for the info on how to get n_balances! It would be great to have both of the convenience functions, but n_balances = length(model$ensemble) could just be described in the docs as an alternative.

egr95 commented 2 years ago

Sure thing, I have gone ahead and added in the functions getNumLogRatios() as well as getTidyTable(). Thanks!