Research(?) : Alternative missing-value masks

Feature request

The current RandomObfuscator implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.

But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token THE as your [MASK] for an English text model pre-training task.

I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).

What is the expected behavior?

I suggest two primary options:

Offer configurable alternative masking strategies (e.g. different constants) for users to select
(Preferred) Implement embedding-aware attention per #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]

Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.

What is motivation or use case for adding/changing the behavior?

I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).

As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).

...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.

How should this be implemented in your opinion?

Implement per-field / "embedding-aware" attention, perhaps something like #217
Implement missing/masked value handling as logic in the embedding layer (perhaps something like athewsey/feat/tra) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
Modify RandomObfuscator to use a non-finite value like nan as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps in X.

Are you willing to work on this yourself?

yes

Hello @athewsey,

I think there is room for improvement for the RandomObfuscator but I'm not sure I follow you on all the points. It's a very interesting topic but also very hard. Here are some thoughts of mine:

what does this has to do with embedding aware attention? When you have categorical data and use embeddings, it's quite simple to create a value that represents all the NaNs in your dataset (you just have to map them to the correct index) or emulates missing values (if you do not have any in your training set). So for categorical columns, the problem of what zero represents is not a real problem.
but... the obfuscator is currently applied after embeddings, this is because the pretraining loss function compute basic distances, and in order to try to reconstruct the categories you'll need to add some softmax to the correct embeddings and then compute a distance. This is feasible, it will make the decoder part more complex but could be beneficial (even if in the end you do not use the decoder for final training). I think this is a promising improvement to the pretraining model.
Also, zero is quite an important number for hidden layers in tabnet, as it is what self attention produces. TabNet is not mask aware, the predictions are the results of applying the masks, but 0 from the inputs and a 0 from a mask is the same thing...
About numerical data, there is no way to give non-finite values to the network and hope for back propagation to work, so in the end you need to decide what it means to 'mask' some inputs or be a missing value. The current 'replace by 0' policy is indeed very simple and might be a source of problems if you have a lot of 0s in your original data. Being able to give a custom value seems legit, but it passes this complex choice to the final user.
There are several strategies that could be tested (I don't think the original authors spent much time trying different solutions). One of them could be to apply some sort of cutmix (or SMOTE) : you randomly switch some columns with existing columns from the same batch, and you still try to predict the orginal line. This has the advantage of solving all the questions about 'what's a NaN value?', 'what's so specific about 0'?

Hi @athewsey,

I guess most of the questions I would ask are almost the same as @Optimox's. Indeed, I think the problem exists independently of the embeddings and I don't see how we could ever replace by default a numerical mask with something intrinsically more meaningful than 0, indeed allowing custom values seems like an option.

@Optimox , what do you mean by your last point? Is it switching the values of the entire column for the batch?

Thanks both for your insights! Very useful as I try to wrap my head around it all too.

To @Optimox 's first point, I think that's my bad: I used "embedding-aware attention" above to refer quite narrowly to a #217 -like implementation (rather than the range of perhaps different ways you could think about doing that)... And also "embedding" to refer quite broadly to the general translation from training dataset X to initial batch-norm inputs. I'd maybe characterize the #217 method further as:

Reducing the output dimension of AttentiveTransformer (the attention masks) from post_embed_dim to the raw input n_features, so the model can only (conversely only needs to) learn to attend to features... Regardless of how many dimensions the feature is internally represented by - because these columns will share the same mask/attention weight.

...So although adding "is missing" flag columns for scalars would still double FeatureTransformer input dimensionality, it need not complicate the AttentiveTransformer's task at all (assuming constant n_a, n_d) - so the impact on task complexity need not be as bad as, say, it would be to double up your input columns for a plain XGBoost model.

I do hear & agree with the point about zero being intrinsically special as a "no contribution" value at many points in the network (especially e.g. summing up the output contributions and at attention-weighted FeatureTransformer inputs)... And I'd maybe think of the input side of this as a limitation closely related to what I'm trying to alleviate with the masking?

I wonder if e.g. is_present indicator fields would work measurably better than is_missing in practice? Or even if +1/-1 indicator fields would perform better than 1/0, so FeatureTransformers see an obvious difference between a zero-valued feature that is present, absent, or not currently attended-to.

The idea of swap-based noise rather than masking is also an interesting possibility - I wonder if there's a way it could be implemented that still works nicely & naturally on input datasets with missing values? I'm particularly interested in pre-training as a potential treatment for missing values, since somehow every dataset always seems to be at least a little bit garbage 😂

On the nan piece, I would add that AFAIK it doesn't necessarily need to be a blocker for backprop-based training:

To my understanding if your network takes some non-finite input values, but masks them out before reaching any trainable parameters then it shouldn't be a problem?
You can't just do it with maths of course because e.g. nan * 0 = nan, But you can still do normalizing operations performantly in-graph e.g. using functions like torch.isfinite() and tensor indexing?
I'm not saying this athewsey/tabnet/feat/tra draft is particularly performant, but the gross for loops come from iterating over features (because they might have different masking configurations) rather than the logic for handling an individual feature (feature[~torch.isfinite(x_feat)] = feat_nonfinite_mask). It could probably be vectorized with some more time/brains.

...But encapsulating this in the PyTorch module itself would hopefully be more easily usable with nan-containing inputs (e.g. gappy Pandas dataframes) and not a noticeable performance hit over e.g. having to do it in a DataLoader anyway? Just a bit less optimal than if you wanted to e.g. pre-process the dataset once and then run many training jobs with it.

Of course I guess the above assumes the Obfuscator comes before the "embedding" layer and has no backprop-trainable parameters - I'd have to take a closer look at the new pretraining loss function stuff to understand that a bit more and follow your comments on that & the impact to decoder!

@eduardocarvp I was just thinking of randomly swapping some columns for each row with another random row (that would probably mean we need to lower the percentage of columns that you swap in order to be able to reconstruct the original input).

@athewsey I think I need to have a closer look at all the links you are referring to, I might have missed something.

By re-reading the conversation, I think what you are looking for would be to create a new sort of continuous/categorical embeddings. Current embeddings take ints as inputs, those ints refer to the index of the embedding matrix which will be use to pass through the graph. You could change the embedding module to allow non finite values which will go into a specific row of the matrix (this is for categorical features). For continuous features, I wonder if there is a way to do the same thing : if you have a non finite value, then pass through a one dimensional trainable embedding (allowing the model to learn how to represent non finite values) if the value is finite then simply pass it through the network.

Wouldn't something like this solve all the problems that you are pointing? (I don't know if it's feasible but I think it should be - but I'm wondering why this does not exist already ^^)

@eduardocarvp something like this (from https://www.kaggle.com/davidedwards1/tabularmarch21-dae-starter) :

class SwapNoiseMasker(object):
    def __init__(self, probas):
        self.probas = torch.from_numpy(np.array(probas))

    def apply(self, X):
        should_swap = torch.bernoulli(self.probas.to(X.device) * torch.ones((X.shape)).to(X.device))
        corrupted_X = torch.where(should_swap == 1, X[torch.randperm(X.shape[0])], X)
        mask = (corrupted_X != X).float()
        return corrupted_X, mask

dreamquark-ai / tabnet

Research(?) : Alternative missing-value masks #278

Feature request