Open onatyap opened 3 years ago
Note that sadedegel is a library and we should be picky in adding "features". The problem you solved is a part of a more general problem called normalization. Here is the roadmap for adding such a feature
See #190
I've tested the repetition correction for you @onatyap as you asked. Here are the results:
Prebuilt Model | Original Result | Preprocessed Result |
---|---|---|
Tweet Sentiment Classification | 3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669 | 3-Fold F-1: 0.8587 5-Fold F-1: 0.8640 |
Movie Review Sentiment Classification | F-1: 0.8258 | F-1: 0.8242 |
Telco Tweet Sentiment Classification | F-1: 0.6871, Accuracy: 0.6925 | F-1: 0.696, Accuracy: 0.691 |
Turkish Customer Reviews Classification | F-1: 0.851 | F-1: 0.852 |
Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in:
"Süppeeeer"
,"Berbaaat"
,"Muhteşeemmmm"
Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.
The output for examples above are as follows:
"Süper"
,"Berbat"
,"Muhteşem"
Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?