Open dhowe opened 7 years ago
Here is problematic case in our untokenize tests:
a. The student said 'learning is fun' b. We should consider the students' learning
In each case, when we get to the single-quote, we don't know whether it is the start of a new quotation or an apostrophe at the end of the previous word
@cqx931 any thoughts?
When I test RiTa.tokenize(), student’s
stays as one token while 'students'' becomes two.
But if student’s
is regarded as one token, shouldn't 'students'' also stays as one? If this is ok, then it would be an issue for tokenize(), rather than untokenize().
If we do think 'students'' should be broke into two tokens, then are words ending in 's' the only case here? (I can't come up with more cases... ) If it is the only case we could check whether the token before is ending in 's' to differentiate case a and b.
This is a good question. I think punctuation that is internal to a word (e.g., student's) should be a single token, while punctuation at end of word (e.g., students' or students" or students, or students:) should be two. One reason is that we don't want the 's' character in "student's" to be a separate token... though we also need to consider the SPLIT_CONTRACTIONS flag.
In any case @cqx931, can you check/fix the known-issues test for tokenize() in RiTaJS? then lets discuss
First fixed the issue in tokenize(): https://github.com/dhowe/RiTaJS/pull/88/commits/cfa2e90eb856a3d946b29e0f771f7a925d538b07
Please check: https://github.com/dhowe/RiTaJS/pull/89 Fixed the rest issues with untokenize() and added other curly quotes to tokenize().
Excellent.
We still have this failing case:
var expected = "The student said 'learning is fun'";
var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
var output = RiTa.untokenize(input);
deepEqual(output, expected);
But lets leave that alone for now.
Next, please 1) update the RiTa java tests to match those in JS, then 2) update RiTa.tokenize()/untokenize() functions in java to match the behavior of RiTaJS... you may want to simply rewrite those functions as ports from JS, rather than trying to adapt what's there now
Updated Java and fixed the new failing case.
Please check https://github.com/dhowe/RiTa/pull/480 and https://github.com/dhowe/RiTaJS/pull/90
Can this be closed? @dhowe
Do we have this case in KnownISsues for both platforms?
var expected = "The student said 'learning is fun'";
var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
var output = RiTa.untokenize(input);
deepEqual(output, expected);
This test case is already solved for both platforms (in RiTaTest).
So we have both of these arrays as inputs to untokenize() tests (which do pass):
Pls explain how the algorithm groups the first single-quote in 1 with the previous word (students'), but does the opposite in 2, and groups it with the subsequent word ('learning)?
@cqx931 ?
What do you mean by grouping here? Isn't both in case 1 and case 2, that the single quote is considered as a single token?
Yes, but we are talking about RiTa.untokenize here, so the output is a single string.
@kennyviperhk Please verify that both of the tests below pass (in both RiTa and RiTaJS) for RiTa.untokenize:
expected = "We should consider the students' learning";
input = ["We", "should", "consider", "the", "students", "'", "learning" ]
var output = RiTa.untokenize(input);
equal(output, expected);
expected = "The student said 'learning is fun'";
input = ["The", "student", "said", "'", "learning", "is", "fun", "'"]
var output = RiTa.untokenize(input);
equal(output, expected);
please see my fix (in RiTaJS b1e70577a8fc5c86) and apply the same fix to the failing test in RiTa-java