egr95 / R-codacore

An R package for learning log-ratio biomarkers from high-throughput sequencing data.
Other
21 stars 3 forks source link

Non -robust results and non-convergence #20

Closed Glfrey closed 1 year ago

Glfrey commented 1 year ago

Hello,

I notice a curious result in which the the same model, when simply reran with no input or parameter alterations, returns a different set of log-ratios. There Is some commonality to them, but at no point am I able to produce the same set of outputs.

When running the model, a number of warnings are produced each time, such as

Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
6: glm.fit: fitted probabilities numerically 0 or 1 occurred
7: glm.fit: algorithm did not converge
8: glm.fit: fitted probabilities numerically 0 or 1 occurred
9: glm.fit: algorithm did not converge
10: glm.fit: fitted probabilities numerically 0 or 1 occurred
11: glm.fit: algorithm did not converge
12: glm.fit: fitted probabilities numerically 0 or 1 occurred
13: glm.fit: algorithm did not converge

and so is the lack of robustness a result of this lack of convergeance? In which case should there be some kind of warning regarding the trustworthyness of results? Or are these two things unrelated?

Thank you!

Gill

Glfrey commented 1 year ago

Apologies, I just noticed this was also raised in #13, however I wonder if you'd also comment gem errors and if there's anything I can do to prevent them (or indeed, what might be causing them). Could it be a case of overfitting? I am using a very small dataset (8 positive, 12 negative) as a test run for a larger amount of data coming my way soon.

I'm working on familiarising myself with how codacore works, but I'm not quite there yet so I'd be really grateful for your help.

Thank you!

egr95 commented 1 year ago

Hi Gill, thank you for the question. Regarding reproducibility, you are right to point out #13, and note this is also discussed in the Guide (section 3). Basically if you want to get identical results on repeated runs, you would need to set the random seed (including for Tensorflow).

Secondly, regarding the warnings, these typically arise in classification problems when your data become perfectly separated by the model. In this case, the iterative process to fit a GLM (I think Newton Raphson is used for the binomial glm()) can take the likelihood to infinity once it finds the right decision boundary. This seems likely to be the case in your problem, given the low number of data points. While you can still use the model to classify new test points, bear in mind you are particularly at risk of overfitting with such a small training set.

Glfrey commented 1 year ago

Hi @egr95 ,

Thank you for your help. I'm not planning to use this dataset to classify, I'm just getting to grips with CoDaCoRe ready for when my larger dataset comes through.

Just out of interest, what processes require a seed in CoDaCoRe? Is it just the gradient descent?

egr95 commented 1 year ago

You're welcome. I wouldn't say any random seed is "required", but it is available if you want to reproduce the source of randomness in the algorithm. The Tensorflow seed is related to the gradient descent aspect, whereas the R seed is related to the "discretization" step of the algorithm (see Section 3.3 of the paper). Basically, one can think of the gradient descent procedure as finding a "continuous" set "candidate log-ratios", from which we then select the specific log-ratio that optimizes cross-validation error. The cross-validation folds are selected at random, thereby adding a source of randomness to our algorithm.

Glfrey commented 1 year ago

Hi @egr95 ,

Thanks again, that makes perfect sense.