Closed waystogetthere closed 1 year ago
As you pointed out, SAN cannot deal with categorical variables, they need to be first transformed into numeric ones. 1-hot encoding is certainly a way of doing so.
The easiest solution in the case of binary variables may be to include only male
or only female
variable (since one of them is completely redundant: after standardisation, you would have male = - female
). If you insist on keeping both in the data, the average should be ok.
However, note that some other feature importance scores do not behave this way: for example, random forests (the link to scikit implementation) split the signal among the redundant variables and in that case, summing up is the right choice.
Another minimcomment here -- "The SAN will output 2 feature importance respectively for these 2 dimensions. Can we just simply get their average?"
To me it's not clear why e.g., max([val1, val2, ..., valn])
wouldn't also offer some information on the relevance -- i.e., if at least some part of a feature (value subspace) is consistently highly ranked (not all high on average); this was not explored yet afaik (average could be small if all other values are irrelevant).
Thank you very much for both of your comments! I think it is practical, as @Petkomat points out, to set a binary variable to be opposite, like male=-female
. For SAN maybe it cannot handle categorical variable with more than 2 types.
And for @SkBlaz reply, if a feature is high-dimensional (dim >= 2): x=[val1, val2, ..., valn]
a possible way is to extract their maximum as the representation, though doing so may not be explored yet.
This is inspired as it points out a possible way to handle high-dimensional features, not only categorical variables. Sometimes I need to Combine some raw features (real-world feature with clear meaning:, weight, heights, age...) and hidden variables (e.g. LSTM hidden state, with no clear meaning at each dimension ) together, the latter always comes with high dimensions. This may help when analysing the data together with hidden state in Neural Network.
Thanks a lot again for both of your fast reply, so responsive and inspiring, Happy New Year!
@waystogetthere Having a "high-dimensional feature", e.g., a hidden state or embedding (of dimensionality n
) actually means having n
features. Do not "collapse" it into a single value by taking max(vector)
or mean(vector)
: you will lose most of the information. So your workflow is something like this:
The answers above refer only to the last step (Step 3), i.e., once you have the importance values [imp1, imp2, ..., impN]
for every single feature in a group, the question is how to aggregate them into a single value.
If the features are not redundant (as, male
and female
are), I think sum
should be the only right choice, otherwise, you would underestimate the importance of the group. For example, in the case of hidden states, all of them should be approximately equally important and taking only max
(which approximately equals mean
) will effectively ignore all but one contribution (sum = N x mean)
.
If they are redundant, we have to consider this two cases:
A. The signal is being split among the features in the group (in this case, sum
should be used)
B. The signal is copied (in this case, avg
, max
should be more informative, sum
would overestimate the importance)
I am not sure which of the above holds and additional experiments/analysis would be needed to determine that. Luckily, there is a simple experiment that could determine this:
y
x1
such that P(y = x) = 0.75
x2
which is the exact copy of x1
Compute feature importance values for this dataset and for the dataset where x2
is excluded. If the importance values of x1
considerably change, we have case A. Otherwise (the importance value of x1 does not change much), we are in case B.
Thanks a lot for your detailed reply, it is very clear!
To conclude, 'n-dimensional' feature (hidden state, embedding, one-hot) must be treated as 'n' features. Their overall importance is the summation of each dimension's importance: $IMP = \sum_i imp_i$.
Also may I know why 'female' and 'male' are redundant? You said that 'after standardization, you would have male = - female'. As I know their one-hot encoding is:
male: [0, 1] female: [1, 0]
I don't know why 'standardization' (I think it refers to $\frac{x-mean}{std}$) can have the result: 'male = - female' The claim 'redundant' may not be that straight-forward to me. May you elaborate a bit more detailed? Thanks a lot!
They are redundant in the sense that knowing one gives you the same information as knowing both.
Before standardization: $m + f = 1$ (where $m$ stands for male and $f$ for female). This means that you can compute one from another (e.g., $m = 1 - f$).
The mean of $m$ and $f$ (since they are 0-1 variables) is their probability of being one. Suppose $p = P(\text{the attribute} = \text{male})$. Then $\mathbb{E}[m] = p$. Similarly, $\mathbb{E}[f] = 1- p$ since $1 - p = P(\text{the attribute} = \text{female})$.
Both $m$ and $f$ have the same standard deviation $\sigma$ (since the transformation $X\mapsto 1 - X$ does not change the variance).
If $m + f = 1$, then $m + f - 1 = 0$. Write $1 = p + (1 - p)$ and rearrange the terms: $(m - p) + (f - (1 - p)) = 0$. Divide this by $\sigma$ and you obtain $m' + f' = 0$, where $m' = (m - \mathbb{E}[m])/\sigma$ and $f' = \dots$ are the standardized values of $m$ and $f$.
Therefore, you can compute $m'$ from $f'$ (namely, $m' = -f'$), so they are still redundant.
Hello, May I know how can SAN deal with categorical variables? For example, given the feature 'gender', we can simply do the one-hot encoding:
The SAN will output 2 feature importance respectively for these 2 dimensions. Can we just simply get their average? Thank you very much!