Closed plpxsk closed 6 years ago
Hey @pavopax - just so I'm clear, can you show an example of what this might look like in R code?
sure:
library(dplyr)
library(forcats)
library(survival)
N <- 20
df <- data_frame(
time=rexp(N),
event=as.integer(rbinom(N, 1, .5)),
treatment_=c(rep("control", N/2), rep("active", N/2)))
cox <- coxph(Surv(time, event) ~ treatment_, data=df)
In the model summary below, exp(coef)
for "treatment_control" is the hazard ratio of control:active , where active is the "reference" since it is alphabetically before "control"
> cox
Call:
coxph(formula = Surv(time, event) ~ treatment_, data = df)
coef exp(coef) se(coef) z p
treatment_control -0.373 0.689 0.917 -0.41 0.68
And I'd like to see the reverse HR: active to control, where control is the reference ((common in pharma/biotech).
So I reverse it like so:
df$treatment_ <- df$treatment_ %>% as.factor() %>% forcats::fct_rev()
And the result:
cox <- coxph(Surv(time, event) ~ treatment_, data=df)
> cox
Call:
coxph(formula = Surv(time, event) ~ treatment_, data = df)
coef exp(coef) se(coef) z p
treatment_active 0.373 1.452 0.917 0.41 0.68
Now, this shows me the desired hazard ratio, which I obtained by reversing the "treatment_" factor variable via fct_rev()
(It is actually easy to get this hazard ratio: exp(coef)
is just 1/exp(coef)
from the first cox model )
Let me know if this solves your question: in lifelines, all the processing & encoding is done outside of lifelines. That is, before the dataframe even hits lifelines, it should be encoded as you wish. So it's up to you whether the baseline is control or treatment in the dataframe column. The inference should proceed as normal, and the resulting cph.print_summary
should represent the choice chosen.
Popular encoding functions are available in the patsy library (which has a R-style formula system) and Pandas get_dummies
function (but this doesn't allow you wish to chose as the baseline.
df = pd.DataFrame.from_records([
{'t': 1, 'e': 0, 'cata': 'control'},
{'t': 2, 'e': 0, 'cata': 'treatment'},
{'t': 3, 'e': 0, 'cata': 'treatment'},
{'t': 4, 'e': 0, 'cata': 'control'},
{'t': 5, 'e': 0, 'cata': 'treatment'},
{'t': 6, 'e': 0, 'cata': 'control'},
])
df = pd.get_dummies(df, columns='cata')
# choice is now yours to make either baseline, so either:
del df['cata_control']
# or
del df['cata_treatment']
cph = CoxPHFitter()
cpf.fit(df, 't', 'e')
...
Yes, this makes complete sense 💯
As suggested, I will make dummies with pd.get_dummies(drop_first=False)
and then drop my "reference" level column from the data frame.
And I fully agree with your philosophy of "do all processing/encoding outside of lifelines."
So the modification could instead be made in pd.get_dummies()
as discussed at https://github.com/pandas-dev/pandas/issues/12042#issuecomment-343294545. For future viewers, a more general relevant thread is at https://github.com/pandas-dev/pandas/issues/8918.
Issue can be closed.
With a categorical covariate with 2 levels: "A" and "B", it looks like "A" is kept as reference level, so that cox regression output shows hazard ratio of B to A. What if I'd prefer the reversed hazard ratio of A to B (so that "B" is reference level")
To clarify, in R, this is done via "factor" and then you can set a reference factor of "B", for example.
Is this possible in
lifelines
?