Open athewsey opened 3 years ago
Hello @athewsey,
I think there is room for improvement for the RandomObfuscator
but I'm not sure I follow you on all the points. It's a very interesting topic but also very hard. Here are some thoughts of mine:
zero
represents is not a real problem.Hi @athewsey,
I guess most of the questions I would ask are almost the same as @Optimox's. Indeed, I think the problem exists independently of the embeddings and I don't see how we could ever replace by default a numerical mask with something intrinsically more meaningful than 0, indeed allowing custom values seems like an option.
@Optimox , what do you mean by your last point? Is it switching the values of the entire column for the batch?
Thanks both for your insights! Very useful as I try to wrap my head around it all too.
To @Optimox 's first point, I think that's my bad: I used "embedding-aware attention" above to refer quite narrowly to a #217 -like implementation (rather than the range of perhaps different ways you could think about doing that)... And also "embedding" to refer quite broadly to the general translation from training dataset X
to initial batch-norm inputs. I'd maybe characterize the #217 method further as:
Reducing the output dimension of
AttentiveTransformer
(the attention masks) frompost_embed_dim
to the raw inputn_features
, so the model can only (conversely only needs to) learn to attend to features... Regardless of how many dimensions the feature is internally represented by - because these columns will share the same mask/attention weight.
...So although adding "is missing" flag columns for scalars would still double FeatureTransformer
input dimensionality, it need not complicate the AttentiveTransformer
's task at all (assuming constant n_a
, n_d
) - so the impact on task complexity need not be as bad as, say, it would be to double up your input columns for a plain XGBoost model.
I do hear & agree with the point about zero being intrinsically special as a "no contribution" value at many points in the network (especially e.g. summing up the output contributions and at attention-weighted FeatureTransformer
inputs)... And I'd maybe think of the input side of this as a limitation closely related to what I'm trying to alleviate with the masking?
I wonder if e.g. is_present
indicator fields would work measurably better than is_missing
in practice? Or even if +1/-1
indicator fields would perform better than 1/0
, so FeatureTransformer
s see an obvious difference between a zero-valued feature that is present, absent, or not currently attended-to.
The idea of swap-based noise rather than masking is also an interesting possibility - I wonder if there's a way it could be implemented that still works nicely & naturally on input datasets with missing values? I'm particularly interested in pre-training as a potential treatment for missing values, since somehow every dataset always seems to be at least a little bit garbage 😂
On the nan
piece, I would add that AFAIK it doesn't necessarily need to be a blocker for backprop-based training:
nan * 0 = nan
, But you can still do normalizing operations performantly in-graph e.g. using functions like torch.isfinite() and tensor indexing?for
loops come from iterating over features (because they might have different masking configurations) rather than the logic for handling an individual feature (feature[~torch.isfinite(x_feat)] = feat_nonfinite_mask
). It could probably be vectorized with some more time/brains....But encapsulating this in the PyTorch module itself would hopefully be more easily usable with nan
-containing inputs (e.g. gappy Pandas dataframes) and not a noticeable performance hit over e.g. having to do it in a DataLoader anyway? Just a bit less optimal than if you wanted to e.g. pre-process the dataset once and then run many training jobs with it.
Of course I guess the above assumes the Obfuscator comes before the "embedding" layer and has no backprop-trainable parameters - I'd have to take a closer look at the new pretraining loss function stuff to understand that a bit more and follow your comments on that & the impact to decoder!
@eduardocarvp I was just thinking of randomly swapping some columns for each row with another random row (that would probably mean we need to lower the percentage of columns that you swap in order to be able to reconstruct the original input).
@athewsey I think I need to have a closer look at all the links you are referring to, I might have missed something.
By re-reading the conversation, I think what you are looking for would be to create a new sort of continuous/categorical embeddings. Current embeddings take ints as inputs, those ints refer to the index of the embedding matrix which will be use to pass through the graph. You could change the embedding module to allow non finite values which will go into a specific row of the matrix (this is for categorical features). For continuous features, I wonder if there is a way to do the same thing : if you have a non finite value, then pass through a one dimensional trainable embedding (allowing the model to learn how to represent non finite values) if the value is finite then simply pass it through the network.
Wouldn't something like this solve all the problems that you are pointing? (I don't know if it's feasible but I think it should be - but I'm wondering why this does not exist already ^^)
@eduardocarvp something like this (from https://www.kaggle.com/davidedwards1/tabularmarch21-dae-starter) :
class SwapNoiseMasker(object):
def __init__(self, probas):
self.probas = torch.from_numpy(np.array(probas))
def apply(self, X):
should_swap = torch.bernoulli(self.probas.to(X.device) * torch.ones((X.shape)).to(X.device))
corrupted_X = torch.where(should_swap == 1, X[torch.randperm(X.shape[0])], X)
mask = (corrupted_X != X).float()
return corrupted_X, mask
Feature request
The current
RandomObfuscator
implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token
THE
as your[MASK]
for an English text model pre-training task.I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).
What is the expected behavior?
I suggest two primary options:
Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.
What is motivation or use case for adding/changing the behavior?
I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).
As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).
...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.
How should this be implemented in your opinion?
RandomObfuscator
to use a non-finite value likenan
as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps inX
.Are you willing to work on this yourself?
yes