Need Semantically correct handling of Unicode apostrophe's

KshitijKarthick commented 8 years ago

Code section to be looked into

hccorpus_preprocessor.py : _clean_word, _tokenize_sentences
Determine need for inclusion of the commented out code in _tokenize_sentences and if any kind of punctuation handling needed at sentence level or only word level suffices
Cases that need to be satisfied:
i'm -> i
test's -> test
tests' -> tests
hello,-> hello // Resolved by #1
world? -> world // Resolved by #1

prarthana-s commented 8 years ago

Further fixes which are possible: haven't -> have not doesn't -> does not we've -> we have it's -> it is

Ambiguity: John's house -> 'John is house' instead of 'John house'

prarthana-s commented 8 years ago

Running into encoding problems. ' (apostrophe encoded as \xe2\x80\x99). So the regex for replacing punctuation doesn't apply here. Found this function to fix it:

http://stackoverflow.com/questions/27996448/python-encoding-decoding-problems

prarthana-s commented 8 years ago

pattern=ur"(\p{P}+[a-zA-Z]*)", repl=' '

Resolves: i'm -> i test's -> test

Causes: (Picture is here -> is here

prarthana-s commented 8 years ago

When I apply the _clean_data function, the ' get encoded again in unicode strings such as \xe2\x80\x99t.

KshitijKarthick commented 8 years ago

@prarthana-s You don't need to worry about that in the usage of _clean_word in hccorpus_preprocessor.py in preprocessor module. When the corpus preprocessor object is instantiated, The data from the file is read with the required encoding correctly and a Unicode string is passed to the _clean_word function.

KshitijKarthick commented 8 years ago

Under Commit 9b918f0bc1463702e986e4b89af676de77202d09 _clean_word utilises a character class [a-zA-Z] for blacklisting, this is specific to the language English.

This could lead to problems when other language corpus is utilised, should look for another workaround for handling unicode preprocessing of words.

KshitijKarthick commented 8 years ago

@prarthana-s on commit 1eaa75d574390a8b9000358161a386ffb279f886 Regular expression is updated to "((\p{P}+)|(\p{S}))", for removing punctuations and unnecessary symbols. Can you evaluate how this is different compared to "((\p{P}+)|(\p{S}+))" You were stating some problems and uses with '+' specified in the above Regular Expressions. Can you put down the same here.

prarthana-s commented 8 years ago

Punctuation can appear in a line like "Hello..." so we need to remove all occurrences of the period. Currency symbols usually occur only once. Hence, no need of +.

The problem was with the regex (\p{P}+[a-zA-Z]*) If I have something like "(Please pass that to me.", it was only giving back " pass that to me", as the left bracket and the characters were both being removed.

KshitijKarthick commented 8 years ago

Correct Handling of Punctuations & Apostrophes

These are the scenarios which are semantically correctly resolved in the Preprocessor Modules. Any other case which needs to be covered, kindly reopen the module.

Scenario: they're => they [ Ascii Apostrophe ]

HcCorpusPreprocessor: Commit 93df12d50a795c83a87479b72173eecdcbb114f6
LeipzigPreprocessor: Commit 767cd14648e56641ecac2ea79316d3df89a750da
EmilleCorpusPreprocessor: Commit 3b2bdc2c94babe6d759ed138d6af0f7cf9a71c6c

Scenario: they’re => they [ Unicode Apostrophe ]

HcCorpusPreprocessor: Commit c4cea60b3d5f83e7b03b1ea61125b1316d03bad9
LeipzigPreprocessor: Commit 2bda14d247525ba713030a944212eabe89932ad1
EmilleCorpusPreprocessor: Commit 866181d51dc9c2274d11b466a1575652d61c24b9

Scenario: 'hello' => hello , 'सफलता' => सफलता

HcCorpusPreprocessor: Commit 93df12d50a795c83a87479b72173eecdcbb114f6
LeipzigPreprocessor: Commit 767cd14648e56641ecac2ea79316d3df89a750da
EmilleCorpusPreprocessor: Commit 3b2bdc2c94babe6d759ed138d6af0f7cf9a71c6c

Scenario: ice-cream => [ ice, cream]

HcCorpusPreprocessor: Commit 14e1e9cccd6464206cff1d56a51df6712f589ba5
LeipzigPreprocessor: Commit 14e1e9cccd6464206cff1d56a51df6712f589ba5
EmilleCorpusPreprocessor: Commit 14e1e9cccd6464206cff1d56a51df6712f589ba5

KshitijKarthick / tvecs