Rita.untokenize incorrectly handling apostrophe's at end of word

dhowe / RiTaV1

RiTa: the generative language toolkit

http://rednoise.org/rita

GNU General Public License v3.0

354 stars 78 forks source link

Rita.untokenize incorrectly handling apostrophe's at end of word #477

Open dhowe opened 7 years ago

dhowe commented 7 years ago

please see my fix (in RiTaJS b1e70577a8fc5c86) and apply the same fix to the failing test in RiTa-java

dhowe commented 7 years ago

Here is problematic case in our untokenize tests:

a. The student said 'learning is fun' b. We should consider the students' learning

In each case, when we get to the single-quote, we don't know whether it is the start of a new quotation or an apostrophe at the end of the previous word

@cqx931 any thoughts?

cqx931 commented 7 years ago

When I test RiTa.tokenize(), student’s stays as one token while 'students'' becomes two. But if student’s is regarded as one token, shouldn't 'students'' also stays as one? If this is ok, then it would be an issue for tokenize(), rather than untokenize().

If we do think 'students'' should be broke into two tokens, then are words ending in 's' the only case here? (I can't come up with more cases... ) If it is the only case we could check whether the token before is ending in 's' to differentiate case a and b.

dhowe commented 7 years ago

This is a good question. I think punctuation that is internal to a word (e.g., student's) should be a single token, while punctuation at end of word (e.g., students' or students" or students, or students:) should be two. One reason is that we don't want the 's' character in "student's" to be a separate token... though we also need to consider the SPLIT_CONTRACTIONS flag.

In any case @cqx931, can you check/fix the known-issues test for tokenize() in RiTaJS? then lets discuss

cqx931 commented 7 years ago

First fixed the issue in tokenize(): https://github.com/dhowe/RiTaJS/pull/88/commits/cfa2e90eb856a3d946b29e0f771f7a925d538b07

cqx931 commented 7 years ago

Please check: https://github.com/dhowe/RiTaJS/pull/89 Fixed the rest issues with untokenize() and added other curly quotes to tokenize().

dhowe commented 7 years ago

Excellent.

We still have this failing case:

      var expected = "The student said 'learning is fun'";
      var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
      var output = RiTa.untokenize(input);
      deepEqual(output, expected);

But lets leave that alone for now.

Next, please 1) update the RiTa java tests to match those in JS, then 2) update RiTa.tokenize()/untokenize() functions in java to match the behavior of RiTaJS... you may want to simply rewrite those functions as ports from JS, rather than trying to adapt what's there now

cqx931 commented 7 years ago

Updated Java and fixed the new failing case.

Please check https://github.com/dhowe/RiTa/pull/480 and https://github.com/dhowe/RiTaJS/pull/90

cqx931 commented 7 years ago

Can this be closed? @dhowe

dhowe commented 7 years ago

Do we have this case in KnownISsues for both platforms?

      var expected = "The student said 'learning is fun'";
      var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
      var output = RiTa.untokenize(input);
      deepEqual(output, expected);

cqx931 commented 7 years ago

This test case is already solved for both platforms (in RiTaTest).

dhowe commented 7 years ago

So we have both of these arrays as inputs to untokenize() tests (which do pass):

["We", "should", "consider", "the", "students", "'", "learning" ]
["The", "student", "said", "'", "learning", "is", "fun", "'"]

Pls explain how the algorithm groups the first single-quote in 1 with the previous word (students'), but does the opposite in 2, and groups it with the subsequent word ('learning)?

dhowe commented 6 years ago

@cqx931 ?

cqx931 commented 6 years ago

What do you mean by grouping here? Isn't both in case 1 and case 2, that the single quote is considered as a single token?

dhowe commented 6 years ago

Yes, but we are talking about RiTa.untokenize here, so the output is a single string.

dhowe commented 4 years ago

@kennyviperhk Please verify that both of the tests below pass (in both RiTa and RiTaJS) for RiTa.untokenize:

    expected = "We should consider the students' learning";
    input = ["We", "should", "consider", "the", "students", "'", "learning" ]
    var output = RiTa.untokenize(input);
    equal(output, expected);

    expected = "The student said 'learning is fun'";
    input = ["The", "student", "said", "'", "learning", "is", "fun", "'"]
    var output = RiTa.untokenize(input);
    equal(output, expected);