Closed M-R-T-U-D closed 12 months ago
Hello, Thanks for reviewing the code, as this is an unofficial implementation there might be either some mistakes or some voluntary differences from the original paper. Also you can see in ArXiV that the first vesion of the paper was published in August 2019 and the last one is from December 2020, the first commit of this repo was done in October 2019, so it has been difficult to keep track of the changes across different versions.
To answer your questions:
After reviewing the code, I also noticed that I'm creating shared layers for the decoder part. I'm not sure what the paper says about that, what's your opinion ?
Hi Optimox,
I apologize for the second question, I must have overlooked the loop part during the review 😅 .
About the decoder: I have read the recent version of the paper and there is no specific details given about whether the feature transformers should work differently for the decoder compared to the encoder. So for that matter, I assume that it should also contain shared layers similar to the encoder. It is quite difficult to say whether retaining the same feature transformer structure of the encoder in the decoder is beneficial or not without thorough ablation experiments of the decoder. The paper mentions the following:
"While increasing the depth, parameter sharing between feature transformer blocks across decision steps is an efficient way to decrease model size without degradation in performance. We indeed observe the benefit of partial parameter sharing, compared to fully decision step-dependent blocks or fully shared blocks."
However, this is general to the TabNet model as a whole and not specific to only the decoder.
Also, I want to point out one refactoring possibility for the decoder part:
if self.n_shared > 0:
shared_feat_transform = torch.nn.ModuleList()
for i in range(self.n_shared):
if i == 0:
shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False))
else:
shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False))
Can be refactored to:
if self.n_shared > 0:
shared_feat_transform = torch.nn.ModuleList()
for _ in range(self.n_shared):
shared_feat_transform.append(Linear(n_d, 2 * n_d, bias=False))
If I am not mistaken.
Other than that, I want to thank you for your response and the clarifications! It helped me a lot with understanding the TabNet structure.
Thanks this look like a reasonable refactoring :)
Hi, @Optimox. I was reading through the code in
tab_network.py
file to understand the structure of TabNet and the different parts involved. I don't see how some sections of the code in TabNetDecoder and TabNetEncoder relate to the paper of TabNet. I have formulated couple of questions:1) In the paper, it is shown that the decoder created a feature transformer and an FC layer. However, in the code that is implemented like this:
I expected to see feature transformers and reconstruction layers being created in the
n_steps
loop and later on aggregated by summing the output of each decision step. In this case, the outputs of the feature transformers are summed instead and then propagated to one FC layer for output. Does this do the same as what the paper suggests? If so, can you explain how?2) In the encoder init function I saw the following:
And this is utilized in
GLU_Block
like this:Can you explain why FC layers are created instead of shared feature transformers for the
shared_feat_transform
variable? Also, InGLU_Block
class only the first FC is indexed while multiple FC layers are created usingn_shared
, is there a reason behind that?These are all the questions that I have of things that are unclear to me still. I would appreciate it if you could clarify. Thanks in advance!