SkBlaz / san

Attention-based feature ranking for propositional data.
GNU General Public License v3.0
27 stars 9 forks source link

Hello, I am wondering how can SAN deals with categorical variables. #6

Closed waystogetthere closed 1 year ago

waystogetthere commented 1 year ago

Hello, May I know how can SAN deal with categorical variables? For example, given the feature 'gender', we can simply do the one-hot encoding:

The SAN will output 2 feature importance respectively for these 2 dimensions. Can we just simply get their average? Thank you very much!

Petkomat commented 1 year ago

As you pointed out, SAN cannot deal with categorical variables, they need to be first transformed into numeric ones. 1-hot encoding is certainly a way of doing so.

The easiest solution in the case of binary variables may be to include only male or only female variable (since one of them is completely redundant: after standardisation, you would have male = - female). If you insist on keeping both in the data, the average should be ok.

However, note that some other feature importance scores do not behave this way: for example, random forests (the link to scikit implementation) split the signal among the redundant variables and in that case, summing up is the right choice.

SkBlaz commented 1 year ago

Another minimcomment here -- "The SAN will output 2 feature importance respectively for these 2 dimensions. Can we just simply get their average?"

To me it's not clear why e.g., max([val1, val2, ..., valn]) wouldn't also offer some information on the relevance -- i.e., if at least some part of a feature (value subspace) is consistently highly ranked (not all high on average); this was not explored yet afaik (average could be small if all other values are irrelevant).

waystogetthere commented 1 year ago

Thank you very much for both of your comments! I think it is practical, as @Petkomat points out, to set a binary variable to be opposite, like male=-female. For SAN maybe it cannot handle categorical variable with more than 2 types.

And for @SkBlaz reply, if a feature is high-dimensional (dim >= 2): x=[val1, val2, ..., valn] a possible way is to extract their maximum as the representation, though doing so may not be explored yet.
This is inspired as it points out a possible way to handle high-dimensional features, not only categorical variables. Sometimes I need to Combine some raw features (real-world feature with clear meaning:, weight, heights, age...) and hidden variables (e.g. LSTM hidden state, with no clear meaning at each dimension ) together, the latter always comes with high dimensions. This may help when analysing the data together with hidden state in Neural Network.

Thanks a lot again for both of your fast reply, so responsive and inspiring, Happy New Year!

Petkomat commented 1 year ago

@waystogetthere Having a "high-dimensional feature", e.g., a hidden state or embedding (of dimensionality n) actually means having n features. Do not "collapse" it into a single value by taking max(vector) or mean(vector): you will lose most of the information. So your workflow is something like this:

  1. Preprocess the data: 1.1 Convert categorical features into feature groups 1.2 Join all the representations into a single dataset (hidden states, 1-hot encodings, original numeric features ...) 1.3 [Other stuff]
  2. Compute feature importance for every feature (dimension)
  3. Join feature importance values of every group (either hidden representation or 1-hot encoding of a categorical feature) into a single value

The answers above refer only to the last step (Step 3), i.e., once you have the importance values [imp1, imp2, ..., impN] for every single feature in a group, the question is how to aggregate them into a single value.

If the features are not redundant (as, male and female are), I think sum should be the only right choice, otherwise, you would underestimate the importance of the group. For example, in the case of hidden states, all of them should be approximately equally important and taking only max (which approximately equals mean) will effectively ignore all but one contribution (sum = N x mean).

If they are redundant, we have to consider this two cases:

A. The signal is being split among the features in the group (in this case, sum should be used) B. The signal is copied (in this case, avg, max should be more informative, sum would overestimate the importance)

I am not sure which of the above holds and additional experiments/analysis would be needed to determine that. Luckily, there is a simple experiment that could determine this:

Compute feature importance values for this dataset and for the dataset where x2 is excluded. If the importance values of x1 considerably change, we have case A. Otherwise (the importance value of x1 does not change much), we are in case B.

waystogetthere commented 1 year ago

Thanks a lot for your detailed reply, it is very clear!

To conclude, 'n-dimensional' feature (hidden state, embedding, one-hot) must be treated as 'n' features. Their overall importance is the summation of each dimension's importance: $IMP = \sum_i imp_i$.

Also may I know why 'female' and 'male' are redundant? You said that 'after standardization, you would have male = - female'. As I know their one-hot encoding is:

male: [0, 1] female: [1, 0]

I don't know why 'standardization' (I think it refers to $\frac{x-mean}{std}$) can have the result: 'male = - female' The claim 'redundant' may not be that straight-forward to me. May you elaborate a bit more detailed? Thanks a lot!

Petkomat commented 1 year ago

They are redundant in the sense that knowing one gives you the same information as knowing both.

Before standardization: $m + f = 1$ (where $m$ stands for male and $f$ for female). This means that you can compute one from another (e.g., $m = 1 - f$).

The mean of $m$ and $f$ (since they are 0-1 variables) is their probability of being one. Suppose $p = P(\text{the attribute} = \text{male})$. Then $\mathbb{E}[m] = p$. Similarly, $\mathbb{E}[f] = 1- p$ since $1 - p = P(\text{the attribute} = \text{female})$.

Both $m$ and $f$ have the same standard deviation $\sigma$ (since the transformation $X\mapsto 1 - X$ does not change the variance).

If $m + f = 1$, then $m + f - 1 = 0$. Write $1 = p + (1 - p)$ and rearrange the terms: $(m - p) + (f - (1 - p)) = 0$. Divide this by $\sigma$ and you obtain $m' + f' = 0$, where $m' = (m - \mathbb{E}[m])/\sigma$ and $f' = \dots$ are the standardized values of $m$ and $f$.

Therefore, you can compute $m'$ from $f'$ (namely, $m' = -f'$), so they are still redundant.