biolab / orange3-survival-analysis

🍊 :heavy_plus_sign: Survival Analysis add-on for Orange3 data mining suite.
GNU General Public License v3.0
2 stars 4 forks source link

Cox coefficients documentation #54

Open leonardosegurat opened 1 year ago

leonardosegurat commented 1 year ago

Is your feature request related to a problem? Please describe.

When sending cox regression coefficients to a data table, only beta terms are included and h0 term is omitted, resulting in incomplete information to compute the results.

Since this was my first cox regression, I did not know that the cox risk score was calculated with a specific expression, and naively calculated it as multiple linear regression. This discrepancy led me to investigate, and while I found documentation for cox regression, I was unable to do so with the widget's docs or searching for Orange Data Mining and the process was time consuming. Importantly, I found there is an h0 term and this is not included in the coefficients output, at least not that I could find, and had to calculate id "by hand" on a spreadsheet by comparing my results with the widget's data output. Far from ideal.

Describe the solution you'd like

  1. Extend cox regression widget documentation to include the equation used to calculate cox risk score, or a reference to an explanation.
  2. Include h0 term in the coefficients output, either as a different output (Data / Coefficients / Constant) or within the coefficients output.

Describe alternatives you've considered

Additional context The cox risk score can be computed with the following equation: h(t) = h0(t) exp( x1b1 + x2b2 + x3b3... xn*bn) Where h0 indicates a "base risk" term, x's correspond to predictor features, and b's correspond to their coefficients.

PS: If you're open to contributions, I'm willing to dedicate some time to researching and helping with documentation. I have no experience working with Open-Source Projects, and minimal coding experience. On the other hand, I do have a strong background in statistics as a Lean Six Sigma Black Belt, a taste for technology and a lot of admiration towards the Open-Source community.

JakaKokosar commented 1 year ago

Hey @leonardosegurat, I apologise for the late reply.

Orange uses survival models implemented in the lifelines package. Just recently we updated the Cox regression widget to output not only regression coefficients but also other statistics, look here.

Screenshot 2022-11-23 at 12 29 49

Indeed there is no reason for not having an additional output channel for estimated baseline hazard. I would imagine this would be a table with two columns; the first column is time and the second is the estimated baseline at that time point. Your thoughts?

As you noticed the documentation is lacking and could use improvements. In Orange, the risk scores (or sometimes refered to as prognostic index) are the predicted partial hazards (the second part of the equation).

leonardosegurat commented 1 year ago

Sounds good! Apologies for the late reply, and thanks for pointing me to the lifelines docs!

I would imagine this would be a table with two columns; the first column is time and the second is the estimated baseline at that time point. Your thoughts?

As I understand it, risk scores / prognostic indexes are constant over time (at least in basic COX regression), so the output h (0) would be a single value that predicts survival over time, and is altered proportionally to whatever the right side of the formula resolves to.

I've got a few more suggestions, but I'd like to refine them a bit before opening a suggestion thread. For example, it'd be nice to have some sort of log-rank matrix when comparing cohorts in Kaplan-Meier plots, so that we can make individual comparisons rather than comparing all of the curves, or compare against the baseline curve (I'm using select rows for now). It would also be useful to have an option to plot the baseline survival curve along with the cohorts, to make comparisons. This can be accomplished with edit domain and concatenation, but it took me a while to get it working. Perhaps this image will make the idea a bit clearer: image That's two low-risk curves (training and validation), the baseline curve (made with a 5th category and a copy of the whole dataset), and two high-risk curves (training and validation, again).

I'll be sure to open issues for these ideas once refined! (If they aren't being worked on already)

Thanks again, and keep up the good work!!