LAMDA-NJU / Deep-Forest

An Efficient, Scalable and Optimized Python Framework for Deep Forest (2021.2.1)
https://deep-forest.readthedocs.io
Other
911 stars 158 forks source link

Survival models #71

Closed yunwezhang closed 3 years ago

yunwezhang commented 3 years ago

Hi maintainer,

I am wondering is that possible to cascade random survival forest (maybe a sksurv model) instead of RF in your deep forest model? As in #48, it seems that the supported model types are classification and regression. (or did I miss some parts of those tutorial docs?)

Thanks.

xuyxu commented 3 years ago

Hi @yunwezhang, after walking through the example on random survival forest in sksurv, I think the biggest problem on using deep forest in survival analysis tasks is how to design good augmented features. In survival analysis, our main concern is the survival predicting function that takes time steps t as the input, right? For now, I cannot figure out how to ingest this into the cascade structure of deep forest.

Since we are not quite familiar with survival analysis, your suggestions would be highly welcomed ;-)

EDIT: We are happy to work on this feature request if this is achievable.

yunwezhang commented 3 years ago

Hi Yixuan,

Yes, you are right about the time steps, the input part of survival models requires a 2-dim thing as the outcome (time+binary status, where this binary means censored or not) but the output is usually a 1-dim vector, either "risk" or "probability" (as in binary classification).

As for the augmented feature steps, i assume you are talking about this part in the model structure? image Is this part corresponding to this part in the paper? image Because to me, if I understand correctly, in the cascade forest part, the augmented features (in-model feature transformation) obtained from each forest are the predicted vectors, which can be obtained from a survival forest (the output survival probability). However, I am not clear about the attached picture part. (I think the 2019 paper has it because it is better for image data....)

Thank you for looking into it and I am not sure how hard it is to add the random survival model. I am happy to chat with you to see how it goes. In summary, the change for the input data needs to be X (n by p), y (both time and status) and the output is probability vector (could be survival risk, 1 year survival probability, 2 year survival prob, etc.) 😊

xuyxu commented 3 years ago

Thanks for your kind explanations @yunwezhang.

As for the augmented feature steps, i assume you are talking about this part in the model structure? Is this part corresponding to this part in the paper?

No, the second figure posted by you shows the multi-grained scanning part, which is not included in this package, since tree ensembles are typically not the best choice for structured data such as images or audios. Augmented features refer to part of the input for hidden cascade layers. For classification, they are predicted class probabilities; For regression, they are predicted target values.

Here are three questions that I would like to ask further.

yunwezhang commented 3 years ago

Hi Yixuan,

Thanks for the fast reply. I am aware that the multi-grain scanning is not included and that's why I asked why do you have the part (first figure) in your model structure instead of starting from the cascade forest.

Answer for the further questions:

  1. To me, RSF yes. There are some DNN based survival models also considered stated-of-art. (But RSF has the best performance in general among the several datasets I tried.)
  2. Yes, only modify the input part (not the feature matrix X_train but the response part y_train) because the output of the RSF is 1-dim, the probability.
  3. I think both the risk score and the predicted probability can be used as augmented features. (My experience of using RSF is in R but theoretically, both packages are based on the same paper so there will be no difference.) My reading from that package shows that the survival probability is not provided in the predict function though.
xuyxu commented 3 years ago

Thanks for the fast reply. I am aware that the multi-grain scanning is not included and that's why I asked why do you have the part (first figure) in your model structure instead of starting from the cascade forest.

The binner in that figure is used to reduce the number of splitting candidates for the sake of acceleration (not used in the original deep forest model). The entire architecture does correspond to the cascade forest structure.

Besides, I have opened up a feature request in sksurv (link), deep forest could benefit from using a mixture of RandomSurvivalForest and ExtraSurvivalTrees in cascade layers. Let's wait for the response from maintainers of sksurv before formally working on this feature request ;-)

yunwezhang commented 3 years ago

got it! yes, let's wait for the reply. To have that extra injection of randomness, it would be better to have ExtraSurvivalTrees.

xuyxu commented 3 years ago

Realizing that we can implement ExtraSurvivalTrees by importing sksurv as a soft dependency, I think we could work on this feature request without extra helps from that community.

Thank you for looking into it and I am not sure how hard it is to add the random survival model. I am happy to chat with you to see how it goes. In summary, the change for the input data needs to be X (n by p), y (both time and status) and the output is probability vector (could be survival risk, 1 year survival probability, 2 year survival prob, etc.) 😊

If you are interested in extending deep forest to the field of survival analysis, could you contact me through an e-mail (Address), so that we can have more discussions before opening a draft PR on this feature ;-)

xuyxu commented 3 years ago

Closed via #14.