dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.55k stars 470 forks source link

How do the hyperparameters effect TabNet size? #481

Closed M-R-T-U-D closed 1 year ago

M-R-T-U-D commented 1 year ago

Hello there,

I am trying to conduct an experiment to research how the division of TabNet effects its latent dimension quality and size. More specifically, I have 2 clients running. Each client has 1 TabNetPretrainer instance. Both clients have the same hyperparameters. But they differ in the section of the dataset they handle. In case of 2 clients, I divide the dataset into two uniform subsets where each subset is assigned to one client. The division of the dataset happens vertically, so it is split column-wise. In other words, each client has the same samples but different subset of features.

My goal is to see how 2 smaller TabNetPretrainers differ in the latent they learn compared to one TabNetPretrainer that is trained on the whole dataset. I must keep everything the same and identical/consistent in both experiment to make sure only the split of the data has an effect on the performance of the smaller TabNetPretrainers. I modified the TabNetPretrainer code to disable shuffling of dataloaders. Other than that, everything else is kept the same.

I noticed that halfing everything in the hyperparams does not result in half of the total parameters (size) of the TabNet model. I also know that n_steps, n_shared, n_independent, n_shared_decoder, n_indep_decoder, n_d and n_a determine the size of the whole model.

My question is: If I want to keep things consistent I need to match the total size of the smaller TabNetPretrainers to that of one whole TabNetPretrainer model. How can I achieve this by modifying hyperparams?

Also note that I also need to keep the latent dimension n_d constant. So if one TabNetPretrainer uses n_d=100, then the experiment with both local TabNetPretrainers should have n_d=50.

I have conducted many experiments until now and strangely enough, the aggregated latent dimension of the two TabNetPretrainers performs better than the latent vector of 1 TabNetPretrainer trained on the whole dataset. I suspect this is due to the differences in the size of the models and also the learning capacity of the models in both settings are different. Is there a way to keep the learning capacity the same also in both experiments? If so, then how can that be achieved. If not, is it even possible to keep learning capacities consistent in both experiments?

Optimox commented 1 year ago

Well, this looks like a complicated question without a definitive answer.

Your experiment will depend on multiple things:

The two first points makes things experiment specific so it will be hard to derive a generic behavior, the last one is entirely up to you and trying different settings and reporting the results would probably be the best thing to do.

As for the number of parameters, they don't depend linearly on each parameter so it's not obvious how to update them, moreover the number of parameters is also dependent on the number of input features.

About the learning capacities, I am not sure this a scientifically sound concept, the entire point of your experiment is to assess learning capacities of the different settings.

I'm afraid I can't help you much more on this one... But I would be happy to hear about your results.

M-R-T-U-D commented 1 year ago

Hi Optimox, Your response helped me a bit to put into perspective what I should do for the experiments so thanks a lot. I only have couple of questions left:

Optimox commented 1 year ago
M-R-T-U-D commented 1 year ago

Alright thanks for the clarifications I will definitely take those points into consideration!