SeldonIO / alibi

Algorithms for explaining machine learning models
https://docs.seldon.io/projects/alibi/en/stable/
Other
2.4k stars 251 forks source link

AnchorText - row-wise sampling for unknown/similarity #434

Open RobertSamoilescu opened 3 years ago

RobertSamoilescu commented 3 years ago

Currently, the unknown and similarity perturbation strategies implement a column-wise sampling procedure of the words to be replaced by UNK token and by similar words, respectively. For a column i (which corresponds to a word/token) the procedure consists of the following steps:

Because the perturbation is performed column-wise, it can happen that some sentences will not be perturbed at all. For example, if we have 3 sentences:

word11 word12 word13 ---> UNK word21  UNK
word21 word22 word23 ---> word21 word22  word23
word31 word32 word33 ---> UNK UNK word33

The AnchorText with language models implements a row-wise sampling procedure, ensuring that at least one word is perturbed (masked). Note that for column-wise sampling there might be additional work to ensure that each sentence has at least one word perturbed.

Is the row-wise sampling for unknown/similarity necessary or not? Convenience for standardizing across methods or not really that important?

jklaise commented 3 years ago

@arnaudvl mentioned that there was a good reason to do this when doing the original implementation, would be good to revisit.