Closed jaadeoye closed 2 years ago
Pycox code - Python
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearnpandas import DataFrameMapper import torch import torchtuples as tt from pycox.models import CoxPH from pycox.evaluation import EvalSurv np.random.seed(1234) = torch.manual_seed(123) import os import pandas as pd df_train = pd.read_csv('/Users/jaadeoye/Desktop/MTP Prediction/Time-to-event/MTP_surv.csv') df_val = df_train.sample(frac=0.2) df_test = pd.read_csv('/Users/jaadeoye/Desktop/MTP Prediction/Time-to-event/MTP_ext.csv') cols_leave = ['P2', 'P3', 'P4','P5','P7','P8','P9','P10','P11','P12','P13','P14', 'P15','P16','P17','P18','P19','P20','P21','P22','P23','P24', 'P25','P26',] leave = [(col, None) for col in cols_leave] x_mapper = DataFrameMapper(leave) x_train = x_mapper.fit_transform(df_train).astype('float32') x_val = x_mapper.transform(df_val).astype('float32') x_test = x_mapper.transform(df_test).astype('float32') get_target = lambda df: (df['duration'].values, df['event'].values) y_train = get_target(df_train) y_val = get_target(df_val) durations_test, events_test = get_target(df_test) val = x_val, y_val in_features = x_train.shape[1] num_nodes = [64, 64, 64] out_features = 1 batch_norm = True dropout = 0.4 output_bias = False net = tt.practical.MLPVanilla(in_features, num_nodes, out_features, batch_norm,dropout, output_bias=output_bias) model = CoxPH(net, tt.optim.Adam) batch_size = 64 lrfinder = model.lr_finder(x_train, y_train, batchsize, tolerance=10) = lrfinder.plot() lrfinder.get_best_lr() model.optimizer.set_lr(0.01) epochs = 512 callbacks = [tt.callbacks.EarlyStopping()] verbose = True log = model.fit(x_train, y_train, batch_size, epochs, callbacks, verbose, val_data=val, val_batch_size=batchsize) = log.plot() model.partial_loglikelihood(*val).mean() = model.compute_baseline_hazards() model.predict_surv_df(x_test)
Survivalmodels code in R
trainset = read.csv("/Users/jaadeoye/Desktop/MTP Prediction/Time-to-event/MTPsurv.csv") testset = read.csv("/Users/jaadeoye/Desktop/MTP Prediction/Time-to-event/MTPext.csv")
if (requireNamespaces("reticulate")) {
set_seed(1234) data = trainset fit = deepsurv(data = trainset)
mtp = deepsurv(data = trainset, activation = "relu", frac = 0.2, num_nodes = c(64L, 64L, 64L), x=NULL, y=NULL, dropout = 0.4, early_stopping = TRUE, epochs = 512L, batch_size = 64L, batch_norm = TRUE, verbose = TRUE, optimizer = "adam", learning_rate = 0.01, shuffle = FALSE) d_pred = predict(fit, newdata = testset, type = "survival") d_pred1 = predict(fit, newdata = testset, type = "all") cindex(d_pred1$risk, testset$time) }
Hey can you possible post your results so I can see what you mean by 'very different'? I've also just noticed a difference between seeds. If you use
set_seed(1234)
in {survivalmodels} then you're running:
In R:
set.seed(1234)
and in Python:
np.random.seed(1234)
torch.manual_seed(1234)
But in your code:
np.random.seed(1234)
_ = torch.manual_seed(123)
Thanks a lot for your response.
I mean the predicted survival probabilities for the test data are very different for the pycox and survivalmodels implementation for my data. Such that those that were predicted (correctly) as having lower probabilities of survival in {pycox} had a higher probability of survival in {survivalmodels}.
For instance, predicted survival probabilities at 60 months for both models on the same patients in test data
Patient pycox survivalmodels 44 0.433 0.921 107 0.424 0.945 110 0.442 0.899
Also, the c-indices are different. For pycox - 0.95, for survival models 0.45
Kindly advise.
Thanks again.
Ah that's very interesting! Unlikely to be a seed problem then. Happy to help debug but unfortunately your example isn't reproducible for me. What I'd need is either the data you're using (if you can share it) or an open-source dataset with equivalent code in Python and R - I don't have time to create matching scripts myself but will happily debug if you can provide these to me
Thanks, this will take me a while to debug and I'll post updates here as I go along. First updates:
net
is identical in both casesPositive update. I ran the example in the pycox package (https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/cox-ph.ipynb) in Python, in {survivalmodels} directly and then with {mlr3proba}. Even without setting seeds, the results are the same:
From Python
>>> ev.concordance_td()
0.6539455839972876
>>> ev.integrated_brier_score(time_grid)
0.16515071994696498
From R
> p$score(msrs(c("surv.cindex", "surv.graf")))
surv.harrell_c surv.graf
0.6164485 0.1639498
This makes me strongly suspect that there is not a bug in coercion between Python/R but that there may be missing parameters or parameters not being set properly etc., that caused your differences above. Will take a look at this later in the week but until then would appreciate if you can triple check all params are actually the same
Thanks a lot for your help Raphael.
Will look at the parameters again thoroughly to be sure they're one and the same.
Best regards
@jaadeoye I'm closing this for now but please reopen if you think it's still a problem
Hello!
Thank you again for the R implementation of the pycox models.
I am trying to reproduce an existing deepsurv model in the pycox on python in R with survivalmodels because it will be easier to use this with R Shiny but unfortunately the prediction results are very different even with the same set.seed. Also, I do not want to use the source_python function on reticulate as this makes the process more taxing?
Please is there any suggestion on a way around this for now?
PS: Please note that I used the exact parameters as the pycox deepsurv model in python.
Thanks again as I await your suggestion