djaszak/NLPAug - Githubissues

The structure of this framework is based on a survey [1] published by Markus Bayer et al. in which alot of different methods for data augmentation in the NLP subject were compared to eachother. A categorization of this methods was done too and this categorization builds the core structure. There are four levels of augmentation methods in the data space. Following I will document them.

Character Level
    Noise -> Introduces errors into data
    Rule based -> Insertion of spelling mistakes, data alterations, entity names and abbreviations
Word Level
    Noise
        Unigram noising -> Replacing words by different random words
        Blank noising -> Replacing words by "_"
        Syntactic noise -> Shortening, alteration of adjectives
        Semantic noise -> Lexical substitution of synonyms (See next point)
        Random swap (EDA)
        Random deletion (EDA)
        Noise instead of zero-padding
        TF-IDF -> Replace uninformative words by other uninformative
    Synonyms
        Page 9-11 of [1] delivers table with different replacement methods and synonym selections. Choose ones that are delivering positive results and implement 3-4
    Embeddings
        Page 12-14 similar table
        Personally I still have problems totally grasping what this approach is exactly doing, so this will be interesting when I will go into details
    Language Models
        Generate similar words with embeddings, higlhy contextualized
        Page 15 method table
Phrase Level
    Structure
        POS-Tagging
            Cropping -> shorten sentences by putting focus und object and subject
            Rotation -> Move flexible fragments
        Semantic Text Exchange method
    Interpolation
        Substructure substitution -> Substitute substructures if same tagged label; 4 replacement rules that can be used in any combination
Document Level
    Translation
        Round-trip translation (RTT) -> Translate word, phrase, sentence or document into one language, then translate back -> Augmented data
    Generative
        Generate new data completely artificial
        Pages 21-22 give some possibilities
        This is the most complicated and new approach, so further information will be written down in an own issues

References

[1] Markus Bayer, Marc-André Kaufhold, Christian Reuter (2021) A Survey on Data Augmentation for Text Classification

djaszak / NLPAug

readme