Unsupervised training result for visualisation in UMAP or tSNE

ensonario commented 2 years ago

Feature request

What is the expected behavior?

After reasing the paper and trying the library it feels the model might be a good approach for the metric training and subsequent data visualistion using UMAP or tSNE. It also feels the masks used in TabNet can be treated as disambiguated features.

What is motivation or use case for adding/changing the behavior?

Better understanding of complex data and representing data as high-level features (ideally disambiguated). I'm trying to understand can it be a drop-in alternative to beta-vae or info-gan (which designed mostly for images)?

How should this be implemented in your opinion?

I suspect it should be available already, the question is how to extract the necessary weights from the model.

Are you willing to work on this yourself? yes

Optimox commented 2 years ago

You can always access the masks from attention (with the explain method) and try to cluster these with UMAP or TSNE but I'm not sure that's what you want ?

ensonario commented 2 years ago

Thanks @Optimox for your prompt response. Yes, I can access masks, but I'm not sure does it make sense to analyse them directly. Is there some kind internal embedding in Tabnet I can use for the vesualisation? In case of VAE for instance, we have bottleneck features, which can be trested as high-level presentation of raw data. And visualisation of these features helps to understand data. So I'm wondering can TabNet be used in the same way? I hope it makes sense :)

Optimox commented 2 years ago

you would need to access the results before final mapping : https://github.com/dreamquark-ai/tabnet/blob/4fa545da50796f0d16f49d0cb476d5a30c2a27c1/pytorch_tabnet/tab_network.py#L480

But to be honest I think it would be better to use VAE if what you want to do is visualization. Also I think it might still be interesting to visualize separation power of attention only (through masks), as it can be seen as a way of reasoning for the model, so it would create clusters that are handled similarly by the model to make a prediction.

Optimox commented 2 years ago

@ensonario do you have any plots to share ?

ensonario commented 2 years ago

Hi @Optimox, I was a bit distracted, bit returned to this task again. I haven't done the visualisation yet, but I'll share if anything interesting comes up from this.

ensonario commented 2 years ago

Hi @Optimox , looking at these unsupervised masks visualisation it feels the masks are not just representing most predictive parameters on each step, but parameters inside the mask seems connected. It's like the mask is representing a high-level feature, for instance (occupation, race and sex) parameters in Mask1 mean that the mask itself represents complex dependencies between these parameters, and can be called "occupaction / race dependency".

Or Mask2 (race, country_of_origin) seems quite representative as well.

Mask3 with education-sex dependency is quite interesting too.

Is this assumption correct or is it just a considence?

And another question, which key parameters should I care for unsupervised traing? cat_emb_dim, n_steps?

What is logic behind setting pretraining_ratio parameter? What is the motivation for 0.3, 0.5 or 0.7 (as in unsupervised example) values?

Optimox commented 2 years ago

Hello @ensonario,

Yes each attention layer tends to 'specialize' in looking at specific features and totally ignoring the others. However, as you can see each mask for a same attention layer is still entry based, meaning that even-though it's specialized, it will still put different values of attention on each features depending on the specific example.
Note that the final attention of the model is the sum of all masks. Also note that you may (or may not) want different attention layers to 'specialize' in different features. The gamma parameter is exactly here for this reason: gamma = 1.0 means you are not adding any constraint for the attention layers to be disjoints, gamma > 1 (I would not advise to go above 2) means you are asking for attention layer to be independent.
Key parameters are n_a, n_d and n_steps.
The pretraining_ratio makes the pretraining exercise easier or harder. The bigger the harder the reconstruction task is. If the ratio is 0.1, during training you are only masking 10% of the input features, which inevitably makes it easier for the model since it only needs to guess 10% of the features. If you set the ratio to 0.9 then it will be much harder for the model to guess 90% of the features based only on 10% of them. The 'good' value to choose depends on how independent your variable are. Anyway you should aim for a reconstruction loss < 1, as long as your unsupervised loss is above 1 you can consider your unsupervised model to be useless.

Hope this helps!

dreamquark-ai / tabnet

Unsupervised training result for visualisation in UMAP or tSNE #414

Feature request