dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.56k stars 473 forks source link

Questions about TabNetDecoder and TabNetEncoder. #447

Closed M-R-T-U-D closed 12 months ago

M-R-T-U-D commented 1 year ago

Hi, @Optimox. I was reading through the code in tab_network.py file to understand the structure of TabNet and the different parts involved. I don't see how some sections of the code in TabNetDecoder and TabNetEncoder relate to the paper of TabNet. I have formulated couple of questions:

1) In the paper, it is shown that the decoder created a feature transformer and an FC layer. However, in the code that is implemented like this:

         for step in range(n_steps):
            transformer = FeatTransformer(
                n_d,
                n_d,
                shared_feat_transform,
                n_glu_independent=self.n_independent,
                virtual_batch_size=self.virtual_batch_size,
                momentum=momentum,
            )
            self.feat_transformers.append(transformer)

        self.reconstruction_layer = Linear(n_d, self.input_dim, bias=False) 

I expected to see feature transformers and reconstruction layers being created in the n_steps loop and later on aggregated by summing the output of each decision step. In this case, the outputs of the feature transformers are summed instead and then propagated to one FC layer for output. Does this do the same as what the paper suggests? If so, can you explain how?

2) In the encoder init function I saw the following:

        if self.n_shared > 0:
            shared_feat_transform = torch.nn.ModuleList()
            for i in range(self.n_shared):
                if i == 0:
                    shared_feat_transform.append(
                        Linear(self.input_dim, 2 * (n_d + n_a), bias=False)
                    )
                else:
                    shared_feat_transform.append(
                        Linear(n_d + n_a, 2 * (n_d + n_a), bias=False)
                    )

And this is utilized in GLU_Block like this:

fc = shared_layers[0] if shared_layers else None

Can you explain why FC layers are created instead of shared feature transformers for the shared_feat_transform variable? Also, In GLU_Block class only the first FC is indexed while multiple FC layers are created using n_shared, is there a reason behind that?

These are all the questions that I have of things that are unclear to me still. I would appreciate it if you could clarify. Thanks in advance!

Optimox commented 1 year ago

Hello, Thanks for reviewing the code, as this is an unofficial implementation there might be either some mistakes or some voluntary differences from the original paper. Also you can see in ArXiV that the first vesion of the paper was published in August 2019 and the last one is from December 2020, the first commit of this repo was done in October 2019, so it has been difficult to keep track of the changes across different versions.

To answer your questions:

  1. I think you are right, the official paper states that there is one FC for each step of reconstruction then a summation while here there is a summation then only one FC. This is indeed an unwanted difference between the paper and pytorch-tabnet library. I don't know if this would make any significant change to the pretraining part (as the only goal here is to get a good representation of the inputs inside the encoder and not the decoder) but it's probably worth changing it. Thanks!
  2. A GLU block has a certain number of GLU Layers one after each other. Each feature transformer has a certain number of shared GLU Layers accross all steps (n_shared) and its own GLU layers. So for each FeatureTransformer there is one shared GLU block and one specific GLU block. As a GLU layer only holds parameters from it's fc layer I defined them once here and they are shared across steps : https://github.com/dreamquark-ai/tabnet/blob/bcae5f43b89fb2c53a0fe8be7c218a7b91afac96/pytorch_tabnet/tab_network.py#L115 About "Also, In GLU_Block class only the first FC is indexed while multiple FC layers are created using n_shared, is there a reason behind that?" : I don't think that only the first fc is used as you can see it this for loop https://github.com/dreamquark-ai/tabnet/blob/bcae5f43b89fb2c53a0fe8be7c218a7b91afac96/pytorch_tabnet/tab_network.py#L771 the only reason the first layer is not in the for loop is that it has different input dimensions.

After reviewing the code, I also noticed that I'm creating shared layers for the decoder part. I'm not sure what the paper says about that, what's your opinion ?

M-R-T-U-D commented 1 year ago

Hi Optimox,

I apologize for the second question, I must have overlooked the loop part during the review 😅 .

About the decoder: I have read the recent version of the paper and there is no specific details given about whether the feature transformers should work differently for the decoder compared to the encoder. So for that matter, I assume that it should also contain shared layers similar to the encoder. It is quite difficult to say whether retaining the same feature transformer structure of the encoder in the decoder is beneficial or not without thorough ablation experiments of the decoder. The paper mentions the following:

"While increasing the depth, parameter sharing between feature transformer blocks across decision steps is an efficient way to decrease model size without degradation in performance. We indeed observe the benefit of partial parameter sharing, compared to fully decision step-dependent blocks or fully shared blocks."

However, this is general to the TabNet model as a whole and not specific to only the decoder.

Also, I want to point out one refactoring possibility for the decoder part:

        if self.n_shared > 0:
            shared_feat_transform = torch.nn.ModuleList()
            for i in range(self.n_shared):
                if i == 0:
                    shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False))
                else:
                    shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False)) 

Can be refactored to:

        if self.n_shared > 0:
            shared_feat_transform = torch.nn.ModuleList()
            for _ in range(self.n_shared):
                    shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False))

If I am not mistaken.

Other than that, I want to thank you for your response and the clarifications! It helped me a lot with understanding the TabNet structure.

Optimox commented 1 year ago

Thanks this look like a reasonable refactoring :)