fix for issue #16 penntreebank tokenizer is not implemented correctly

angeloskath commented 11 years ago

Since it still has some errors, I will try to find a large corpus and make a test to base the development on.

The above errors (the ones I commented in the diff) are due to sed and pcre differences. That code would change every occurence of ?,!,[,... to &.

yooper commented 11 years ago

I read through the sed implementation and the nltk implementation and they differ. For example,:

in the sed version, $this->addPatternAndReplacement("([,;:@#$%&])", " $1 ") would be $this->addPatternAndReplacement("([,;:@#$%&])", " & ")
in the nltk version, $this->addPatternAndReplacement("([,;:@#$%&])", " $1 ") would stay the same.

I do not think we have to strictly adhere to the sed version. I would lean more towards the nltk version, since more people use that library.

On Wed, Sep 25, 2013 at 6:15 AM, Angelos Katharopoulos < notifications@github.com> wrote:

Since it still has some errors, I will try to find a large corpus and make a test to base the development on.

The above errors (the ones I commented in the diff) are due to sed and pcre differences. That code would change every occurence of ?,!,[,... to &.

— Reply to this email directly or view it on GitHubhttps://github.com/angeloskath/php-nlp-tools/pull/17#issuecomment-25075265 .

angeloskath commented 11 years ago

It is not about which version we should adhere to. This does not adhere to any version. I mentioned in the previous comment that the error is due to sed and pcre differences. What this means is that using & in the replacement text in sed references the whole match. In pcre the equivalent is $0 and the equivalent in python (used by nltk) is \g<0>.

The effect of those two lines is that they replace any occurrence of !,?,[,],.... to & (as I have mentioned previously).

angeloskath commented 11 years ago

I created this gist which you can use to test the implementation. There seem to be more errors than just the one I pointed out above. The translation of " to `` and '' has some problems.

Although unit tests should not test the implementation as a black box and it would be fairly easy to write tests that check for correctness of each regex if that is not possible you could use this data set to write a test that checks for correctness. It is obvious that the count of tokens of a couple of sentences is not enough of a check.

In addition to the above, I have also spotted another problem. When an empty string is passed to the tokenizer it falsely reports that the regex failed and it throws an exception. The check should change from !$tmp to $tmp===null.

I have noticed one difference among the guide implementation in sed and nltk's . The original implementation (and ours) splits the thousands separator while nltk doesn't. e.g.: "$300,000" -> "$ 300 , 000" while in nltk it becomes "$ 300,000"

yooper commented 11 years ago

Thanks for putting this together! I will use the gist for purposes of testing.

On Wed, Sep 25, 2013 at 10:36 AM, Angelos Katharopoulos < notifications@github.com> wrote:

I created this gisthttps://gist.github.com/angeloskath/76a40f2b2565c273385ewhich you can use to test the implementation. There seem to be more errors than just the one I pointed out above. The translation of " to `` and '' has some problems.

Although unit tests should not test the implementation as a black box and it would be fairly easy to write tests that check for correctness of each regex if that is not possible you could use this data set to write a test that checks for correctness. It is obvious that the count of tokens of a couple of sentences is not enough of a check.

In addition to the above, I have also spotted another problem. When an empty string is passed to the tokenizer it falsely reports that the regex failed and it throws an exception. The check should change from _!$tmpto $tmp===null_.

I have noticed one difference among the guide implementation in sed and nltk's . The original implementation (and ours) splits the thousands separator while nltk doesn't. e.g.: "$300,000" -> "$ 300 , 000" while in nltk it becomes "$ 300,000"

— Reply to this email directly or view it on GitHubhttps://github.com/angeloskath/php-nlp-tools/pull/17#issuecomment-25091032 .

yooper commented 11 years ago

I did some analysis on the data you provided. How do you want to penntreebank tokenizer to behave? Do we want it behave like nltk or the sed script? Or neither? I do not want to spend too much more time on this issue, unless a specific bug is found. Lets not get lost in the weeds. There are other tasks to work on like documentation and features.

On Wed, Sep 25, 2013 at 11:26 AM, Dan Cardin dcardin2007@gmail.com wrote:

Thanks for putting this together! I will use the gist for purposes of testing.

On Wed, Sep 25, 2013 at 10:36 AM, Angelos Katharopoulos < notifications@github.com> wrote:

I created this gisthttps://gist.github.com/angeloskath/76a40f2b2565c273385ewhich you can use to test the implementation. There seem to be more errors than just the one I pointed out above. The translation of " to `` and '' has some problems.

Although unit tests should not test the implementation as a black box and it would be fairly easy to write tests that check for correctness of each regex if that is not possible you could use this data set to write a test that checks for correctness. It is obvious that the count of tokens of a couple of sentences is not enough of a check.

In addition to the above, I have also spotted another problem. When an empty string is passed to the tokenizer it falsely reports that the regex failed and it throws an exception. The check should change from _!$tmpto $tmp===null_.

I have noticed one difference among the guide implementation in sed and nltk's . The original implementation (and ours) splits the thousands separator while nltk doesn't. e.g.: "$300,000" -> "$ 300 , 000" while in nltk it becomes "$ 300,000"

— Reply to this email directly or view it on GitHubhttps://github.com/angeloskath/php-nlp-tools/pull/17#issuecomment-25091032 .

yooper commented 11 years ago

My updated PR is in

angeloskath commented 11 years ago

The tokenizer works correctly now (exactly the same output with the original sed implementation). There is no reason to mark the test as incomplete. I would change the test to the code in this gist. Notice also the change in the class name (there was a "Tree" missing).

I understand your wanting to work on other features or not wanting to debug something over and over. That doesn't mean that we have to do moderately good work on anything. If you don't feel like working on something, don't. I will get to it when I have time. Work on something you like and find interesting.

P.S.: This test ensures that the tokenizer works correctly. The one you submitted was not correct in many ways.

yooper commented 11 years ago

My pull request is ready to be merged.

angeloskath commented 11 years ago

This PR is merged yet I rebased it to merge it to both the master as a hotfix and the develop branch and github doesn't accept it as merged.

angeloskath / php-nlp-tools

fix for issue #16 penntreebank tokenizer is not implemented correctly #17