Currently, the unknown and similarity perturbation strategies implement a column-wise sampling procedure of the words to be replaced by UNK token and by similar words, respectively. For a column i (which corresponds to a word/token) the procedure consists of the following steps:
the number of words to be perturbed is chosen according to a binomial distribution (n_changed)
n_changed indices are chosen uniformly at random without replacement from [0, 1, ..., num_samples - 1], where num_samples represent the number of requested samples. Those will be the indices of the sentences for which we will pertrub the word in column i.
Because the perturbation is performed column-wise, it can happen that some sentences will not be perturbed at all. For example, if we have 3 sentences:
The AnchorText with language models implements a row-wise sampling procedure, ensuring that at least one word is perturbed (masked). Note that for column-wise sampling there might be additional work to ensure that each sentence has at least one word perturbed.
Is the row-wise sampling for unknown/similarity necessary or not?Convenience for standardizing across methods or not really that important?
Currently, the
unknown
andsimilarity
perturbation strategies implement a column-wise sampling procedure of the words to be replaced byUNK
token and by similar words, respectively. For a columni
(which corresponds to a word/token) the procedure consists of the following steps:n_changed
)n_changed
indices are chosen uniformly at random without replacement from[0, 1, ..., num_samples - 1]
, wherenum_samples
represent the number of requested samples. Those will be the indices of the sentences for which we will pertrub the word in columni
.Because the perturbation is performed column-wise, it can happen that some sentences will not be perturbed at all. For example, if we have 3 sentences:
The AnchorText with language models implements a row-wise sampling procedure, ensuring that at least one word is perturbed (masked). Note that for column-wise sampling there might be additional work to ensure that each sentence has at least one word perturbed.
Is the row-wise sampling for
unknown/similarity
necessary or not? Convenience for standardizing across methods or not really that important?