AnchorText - row-wise sampling for unknown/similarity

Currently, the unknown and similarity perturbation strategies implement a column-wise sampling procedure of the words to be replaced by UNK token and by similar words, respectively. For a column i (which corresponds to a word/token) the procedure consists of the following steps:

the number of words to be perturbed is chosen according to a binomial distribution (n_changed)
n_changed indices are chosen uniformly at random without replacement from [0, 1, ..., num_samples - 1], where num_samples represent the number of requested samples. Those will be the indices of the sentences for which we will pertrub the word in column i.

Because the perturbation is performed column-wise, it can happen that some sentences will not be perturbed at all. For example, if we have 3 sentences:

word11 word12 word13 ---> UNK word21  UNK
word21 word22 word23 ---> word21 word22  word23
word31 word32 word33 ---> UNK UNK word33

The AnchorText with language models implements a row-wise sampling procedure, ensuring that at least one word is perturbed (masked). Note that for column-wise sampling there might be additional work to ensure that each sentence has at least one word perturbed.

Is the row-wise sampling for unknown/similarity necessary or not? Convenience for standardizing across methods or not really that important?

SeldonIO / alibi

AnchorText - row-wise sampling for unknown/similarity #434