Synonym replacement should be the core of my framework, as IMO this is the most interesting as easiest approach. We want to approach 3 different layers:
Synonym Database
Replacement Method
Synonym Selection
For these layers I am choosing 3-4 method proven good in previous work, stated by Table 2 in [1].
Substitutable words are nouns, verbs, adjectives, or adverbs that are not part of a named entity. Each word is replaced with a certain probability.
Only adverbs/adjectives, sometimes nouns, more rarely verbs
No time words, prepositions and mimetic words
No stop words. n random words are replaced (SR) or synonyms are insterted at random position (RI)
The last point is just referncing to 2 techniques defining EDA. All 4 defining EDA should be implemented and used in different behaviours but for this issue, the ones referencing on synonyms should be implemented.
Synonym Selection
Remaining probabilty shared among synonyms based on language model score
Uniform random
Chi-square statistics method (TBD)
[1] Markus Bayer, Marc-André Kaufhold, Christian Reuter (2021) A Survey on Data Augmentation for Text Classification
Synonym replacement should be the core of my framework, as IMO this is the most interesting as easiest approach. We want to approach 3 different layers:
For these layers I am choosing 3-4 method proven good in previous work, stated by Table 2 in [1].
Synonym Database
Replacement Method
Synonym Selection
[1] Markus Bayer, Marc-André Kaufhold, Christian Reuter (2021) A Survey on Data Augmentation for Text Classification