How do the hyperparameters effect TabNet size?

M-R-T-U-D commented 1 year ago

Hello there,

I am trying to conduct an experiment to research how the division of TabNet effects its latent dimension quality and size. More specifically, I have 2 clients running. Each client has 1 TabNetPretrainer instance. Both clients have the same hyperparameters. But they differ in the section of the dataset they handle. In case of 2 clients, I divide the dataset into two uniform subsets where each subset is assigned to one client. The division of the dataset happens vertically, so it is split column-wise. In other words, each client has the same samples but different subset of features.

My goal is to see how 2 smaller TabNetPretrainers differ in the latent they learn compared to one TabNetPretrainer that is trained on the whole dataset. I must keep everything the same and identical/consistent in both experiment to make sure only the split of the data has an effect on the performance of the smaller TabNetPretrainers. I modified the TabNetPretrainer code to disable shuffling of dataloaders. Other than that, everything else is kept the same.

I noticed that halfing everything in the hyperparams does not result in half of the total parameters (size) of the TabNet model. I also know that n_steps, n_shared, n_independent, n_shared_decoder, n_indep_decoder, n_d and n_a determine the size of the whole model.

My question is: If I want to keep things consistent I need to match the total size of the smaller TabNetPretrainers to that of one whole TabNetPretrainer model. How can I achieve this by modifying hyperparams?

Also note that I also need to keep the latent dimension n_d constant. So if one TabNetPretrainer uses n_d=100, then the experiment with both local TabNetPretrainers should have n_d=50.

I have conducted many experiments until now and strangely enough, the aggregated latent dimension of the two TabNetPretrainers performs better than the latent vector of 1 TabNetPretrainer trained on the whole dataset. I suspect this is due to the differences in the size of the models and also the learning capacity of the models in both settings are different. Is there a way to keep the learning capacity the same also in both experiments? If so, then how can that be achieved. If not, is it even possible to keep learning capacities consistent in both experiments?

Optimox commented 1 year ago

Well, this looks like a complicated question without a definitive answer.

Your experiment will depend on multiple things:

the dataset you pick for your experiment
the columns you pick in your split (if there are two sets of correlated features in your dataset, splitting the two sets or half of each set will yield completely different results)
the parameters you pick for tabnet in your experiments

The two first points makes things experiment specific so it will be hard to derive a generic behavior, the last one is entirely up to you and trying different settings and reporting the results would probably be the best thing to do.

As for the number of parameters, they don't depend linearly on each parameter so it's not obvious how to update them, moreover the number of parameters is also dependent on the number of input features.

About the learning capacities, I am not sure this a scientifically sound concept, the entire point of your experiment is to assess learning capacities of the different settings.

I'm afraid I can't help you much more on this one... But I would be happy to hear about your results.

M-R-T-U-D commented 1 year ago

Hi Optimox, Your response helped me a bit to put into perspective what I should do for the experiments so thanks a lot. I only have couple of questions left:

I use Adult Census Dataset as an example. There are two highly correlated columns: "age" and "relationship". I have put "age" into one client and put all the remaining columns in the other client. The performance is worse than some other splits that I experimented with. I have understood your second point as follows: putting, e.g., "age" in client1 and "relationship" and all the remaining columns in client2, then the result is expected to be worse when using latent of TabNet to train a predictor model compared to when they are both in one of the clients. However, I don't understand the second part of that same point. Can you give an example of what you mean with the following:

... splitting ... half of each set will yield completely different results
Is it valid to train only TabNetPretrainer to fetch the latent info or should it be finetuned further using TabNetClassifier for example?
Is it safe to say that when total parameter size of TabNetPretrainers in both clients added up are the same as one TabNetPretrainer, that the model in both experiments is the same in terms of learning capacity?
My goal is to assess the latent quality of both experiments on conventional classifiers such as SVM, XGBoost and RandomForest model. If one model in one of the experiments is more powerful, then it can potentially learn the underlying patterns better than the TabNet model(s) in the other experiment, atleast that is how I understand it. Is this reasoning correct?

Optimox commented 1 year ago

I meant what you said, puttin age and relationship together or appart will yield different results
It's ok to only look at pretraining embeddings, however those embeddings are made to make tabnet better, not an SVM or XGBoost model, so I am not sure what to expect. A more simple Variational Auto Encoder would probably lead to better embeddings.
I think counting number of parameters and saying that this is the learning capacity is fine
It's a bit tricky to correlate the latent quality with the classification result of an other model, because this will depend on the model used. If the embeddings are not linearily separable you won't get a good score with a logistic regression for example, this does not mean that the embeddings are not good, simply too complex for the model. I guess it should be safe with more complex models like XGBoost as thay are state of the art so they should be able to learn. But again, the embeddings are meant to facilitate Tabnet representation, not any other model representations so you won't explore the full capability of the embeddings.

M-R-T-U-D commented 1 year ago

Alright thanks for the clarifications I will definitely take those points into consideration!

dreamquark-ai / tabnet

How do the hyperparameters effect TabNet size? #481