GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
92 stars 15 forks source link

Character Repetition Correction #268

Open onatyap opened 3 years ago

onatyap commented 3 years ago

Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in: "Süppeeeer", "Berbaaat", "Muhteşeemmmm"

Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.

The output for examples above are as follows: "Süper", "Berbat", "Muhteşem"

Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?

husnusensoy commented 3 years ago

Note that sadedegel is a library and we should be picky in adding "features". The problem you solved is a part of a more general problem called normalization. Here is the roadmap for adding such a feature

  1. Generate your test data (Sentences) - Well define you experiment and ensure that you add counter examples. Such as saat, menfaat, faal are all valid words and you don't perform any corrections on it.
  2. Apply your technique in normalizing by reporting
    • Performance
    • Accuracy (False positives and false negatives ofcourse)
  3. Prove that your technique improves several tasks when enabled.
askarbozcan commented 3 years ago

See #190

ertugrul-dmr commented 3 years ago

I've tested the repetition correction for you @onatyap as you asked. Here are the results:

Prebuilt Model Original Result Preprocessed Result
Tweet Sentiment Classification 3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669 3-Fold F-1: 0.8587 5-Fold F-1: 0.8640
Movie Review Sentiment Classification F-1: 0.8258 F-1: 0.8242
Telco Tweet Sentiment Classification F-1: 0.6871, Accuracy: 0.6925 F-1: 0.696, Accuracy: 0.691
Turkish Customer Reviews Classification F-1: 0.851 F-1: 0.852