CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.32k stars 551 forks source link

Cox Regression with a categorical variable #1505

Open anyang-kevin opened 1 year ago

anyang-kevin commented 1 year ago

Hi, I have similar problems with #1203 I want to choose a specific category in Cox Regression for categorical variable, whether in a univariate or multivariate analysis. 1.how to choose category in multivariate Cox? the formula might be formula='C(a,Treatment=(x))+C(b,Treatment=(y))'? 2.I tried to choose a specific category in univariate, but the error message says that Treatment is not defined. This is my code: cph.fit(for_df,duration_col='OS.time',event_col='OS',formula="C(gender,Treatment('female'))") and this is the error message: formulaic.errors.FactorEvaluationError: Unable to evaluate factor C(gender,Treatment('female')). [NameError: name 'Treatment' is not defined]

CamDavidsonPilon commented 1 year ago

Hi @anyang-kevin, try "C(gender, contr.treatment(base='female'))"

anyang-kevin commented 1 year ago

C(gender, contr.treatment(base='female'))

Thanks for your reply,but it dosen't work. code is : cph.fit(for_df,duration_col='OS.time',event_col='OS',formula="C('gender', contr.treatment(base='female'))") and error: formulaic.errors.FactorEvaluationError: Unable to evaluate factor C('gender', contr.treatment(base='female')). [NameError: name 'contr' is not defined] My lifelines version is 0.27.1

And I hope you can answer my first question. I'm sorry I don't know enough about formulaic package. Thank you!

CamDavidsonPilon commented 1 year ago

What version of formulaic do you have? You can you use import formulaic; print(formulaic.__version__) to see

CamDavidsonPilon commented 1 year ago

Also, don't put quotes around gender, it should be:

formula="C(gender, contr.treatment(base='female'))")

anyang-kevin commented 1 year ago
  1. My formulaic version is 0.2.4
  2. gender is dataframe's colname, not a variable.So if I use C(gender,contr.treatment(...)), I will receive an error message: formulaic.errors.FactorEvaluationError: Unable to evaluate factor C(gender, contr.treatment(base='female')). [NameError: name 'gender' is not defined]. I guess those aren't the core issues.

If you need to check my package file, you can tell me the package you need and your email address, maybe it will help you find the reason directly?

CamDavidsonPilon commented 1 year ago

Try upgrading formualic to 0.5.2, pip install formulaic==0.5.2

anyang-kevin commented 1 year ago

Try upgrading formualic to 0.5.2, pip install formulaic==0.5.2

sorry, my python version is 3.7.0, cant update to 0.5.2, it need python version >= 3.7.2, but i have to keep my python version before my project finish.Although I know how to fix it, I think it's a high risk things to change version or keep two version python in Windows system.

CamDavidsonPilon commented 1 year ago

You'll have to use pandas to manipulate the dataframe prior to providing it to .fit then. Ex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

anyang-kevin commented 1 year ago

You'll have to use pandas to manipulate the dataframe prior to providing it to .fit then. Ex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

Actually, That's what I want to know after I solve this problem. If use one hot encode or other method to process categorical variable, the cox result will loss control group? Like in gender, male is 0 and female is 1, and cox model get HR for both male and female. But in categorical variable, male is control and female have HR. Some literature uses the one-hot method, others the control-treat method. Which is the best way to process categorical variable? Or they will get same result? Another problem with the one-hot code is that 0,1,2...... in cox is different. If categorical more than 2 factor, factor 5 might means 5 times effect than factor 1? But they may have same weight in model. Is such an influence acceptable? Or it doesn't create the problem?

Cryptojoyz commented 1 year ago

You'll have to use pandas to manipulate the dataframe prior to providing it to .fit then. Ex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

Could you tell me if I am using a formula, do I no longer need to use pandas for dummy variable conversion, and just need to convert the pandas column type to 'category'? Will lifelines automatically handle the categorical variables?

dcstang commented 2 months ago

@Cryptojoyz It should work. I am using the pandas method .astype() to specify some columns as categories. Currently doing a CoxPH regression with a mix of continuous and categorical variables, which works nicely.