KshitijKarthick / tvecs

Establish Semantic Relatedness across Languages Documentation - http://kshitijkarthick.github.io/tvecs
https://tvecs.kshitijkarthick.me
MIT License
3 stars 0 forks source link

Need Semantically correct handling of Unicode apostrophe's #3

Closed KshitijKarthick closed 8 years ago

KshitijKarthick commented 8 years ago
Code section to be looked into
prarthana-s commented 8 years ago

Further fixes which are possible: haven't -> have not doesn't -> does not we've -> we have it's -> it is

Ambiguity: John's house -> 'John is house' instead of 'John house'

prarthana-s commented 8 years ago

Running into encoding problems. ' (apostrophe encoded as \xe2\x80\x99). So the regex for replacing punctuation doesn't apply here. Found this function to fix it:

http://stackoverflow.com/questions/27996448/python-encoding-decoding-problems

prarthana-s commented 8 years ago

pattern=ur"(\p{P}+[a-zA-Z]*)", repl=' '

Resolves: i'm -> i test's -> test

Causes: (Picture is here -> is here

prarthana-s commented 8 years ago

When I apply the _clean_data function, the ' get encoded again in unicode strings such as \xe2\x80\x99t.

KshitijKarthick commented 8 years ago

@prarthana-s You don't need to worry about that in the usage of _clean_word in hccorpus_preprocessor.py in preprocessor module. When the corpus preprocessor object is instantiated, The data from the file is read with the required encoding correctly and a Unicode string is passed to the _clean_word function.

KshitijKarthick commented 8 years ago

Under Commit 9b918f0bc1463702e986e4b89af676de77202d09 _clean_word utilises a character class [a-zA-Z] for blacklisting, this is specific to the language English.

This could lead to problems when other language corpus is utilised, should look for another workaround for handling unicode preprocessing of words.

KshitijKarthick commented 8 years ago

@prarthana-s on commit 1eaa75d574390a8b9000358161a386ffb279f886 Regular expression is updated to "((\p{P}+)|(\p{S}))", for removing punctuations and unnecessary symbols. Can you evaluate how this is different compared to "((\p{P}+)|(\p{S}+))" You were stating some problems and uses with '+' specified in the above Regular Expressions. Can you put down the same here.

prarthana-s commented 8 years ago

Punctuation can appear in a line like "Hello..." so we need to remove all occurrences of the period. Currency symbols usually occur only once. Hence, no need of +.

The problem was with the regex (\p{P}+[a-zA-Z]*) If I have something like "(Please pass that to me.", it was only giving back " pass that to me", as the left bracket and the characters were both being removed.

KshitijKarthick commented 8 years ago

Correct Handling of Punctuations & Apostrophes

These are the scenarios which are semantically correctly resolved in the Preprocessor Modules. Any other case which needs to be covered, kindly reopen the module.

Scenario: they're => they [ Ascii Apostrophe ]
Scenario: they’re => they [ Unicode Apostrophe ]
Scenario: 'hello' => hello , 'सफलता' => सफलता
Scenario: ice-cream => [ ice, cream]