Deep learning under log likelihood loss does not converge easily.

Dear @Minxiangliu , thank you for you question. At first glance, it may be an experimental issue (overfitting) rather than a TorchSurv loss function issue.

Couple comments that may help you

Pick the right loss: when using cox model, the model optimize by ranking samples within a batch. The larger the batch, the more reliable the loss estimation will be. When using medical image however, it may be hard to have a batch size greater than 8 samples due to GPU memory limits. Two suggestions for you: You can instead try the Weibull model, which does not rank but fit a distribution for each sample. It removes batch size dependancies, which is good for you. If you want to keep cox model, then you should try our Momentum loss, which uses two networks (online and target) alongside a dynamic memory bank (like MoCo for contrastive learning). We have a tutorial notebook here.
Data / Inputs: You are using 189 3D images to predict a time to event target. This may be under sampled for the task. Try to reduce the covariates size (e.g., reduce image dimension, use embeddings, ..). What is your target? How many censored/non-censored patients do you have? Is there any other covariates (e.g., clinical) you can use to help the model? Is there literature on the topic that shown signal?
Modeling parameters: What is your training/validation split? What is your batch size? And your learning rate (the loss profile looks "jumpy)? Your training loss is decreasing close to 0, so its learning something but it is indeed very much overfitting. There are plenty of external ressources on overfitting that you can read and learn from.

Good luck for your project and thank you for using TorchSurv!

Novartis / torchsurv

Deep learning under log likelihood loss does not converge easily. #40