Semi-Supervised Framework for Regression

zjgbz commented 1 year ago

Feature request

What is the expected behavior? I intend to incorporate information from semi-supervised learning for the regression task. For example, I have a dataset with continuous outcome while another dataset without outcome. They are in different "domain". I intend to incorporate the information from the dataset without outcome when training tabnet on dataset with continuous outcome.

What is motivation or use case for adding/changing the behavior? In computational biology, there are lots of regression task (continuous outcome, e.g., physiological indicators (blood cell counts, estimated glomerular filtration rate, etc.)) and tremendous dataset without outcome. To utilize these information in the regression task, I intend to add this feature.

How should this be implemented in your opinion? The first idea in my mind is to modify the tabnet classification function by directly removing the last activation function of the whole architature. But I am not sure, and may I have your suggestions?

Are you willing to work on this yourself? yes, but I need some suggestions from such as if this is reaseonable or worth trying in the technical perspective.

Optimox commented 1 year ago

The TabNetPretrainer can be used indifferently for a classification or regression task.

I don’t understand what is preventing you to use it on unlabeled data and then train a regressor on labeled data?

zjgbz commented 1 year ago

The TabNetPretrainer can be used indifferently for a classification or regression task.

I don’t understand what is preventing you to use it on unlabeled data and then train a regressor on labeled data?

Hi @Optimox , I might misunderstood something. Here is the semi-supervised learning framework I learned from your documentation.

# TabNetPretrainer
unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax' # "sparsemax"
)

unsupervised_model.fit(
    X_train=X_train,
    eval_set=[X_valid],
    pretraining_ratio=0.8,
)

clf = TabNetClassifier(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params={"step_size":10, # how to use learning rate scheduler
                      "gamma":0.9},
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type='sparsemax' # This will be overwritten if using pretrain model
)

clf.fit(
    X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    from_unsupervised=unsupervised_model
)

So, do you mean that this framework can be directly applied to single or multiple regression as below?

# TabNetPretrainer
unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax' # "sparsemax"
)

unsupervised_model.fit(
    X_train=X_train,
    eval_set=[X_valid],
    pretraining_ratio=0.8,
)

clf = TabNetRegressor(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params={"step_size":10, # how to use learning rate scheduler
                      "gamma":0.9},
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type='sparsemax' # This will be overwritten if using pretrain model
)

clf.fit(
    X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['rmse'], # or some customized metric
    loss_fn='rmse', # or a list of loss functions for multi-task regression
    from_unsupervised=unsupervised_model
)

Is it correct? Thank you very much. Best regards, zjgbz

Optimox commented 1 year ago

Yes that is correct.

dreamquark-ai / tabnet

Semi-Supervised Framework for Regression #471

Feature request