computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

SRXSegmenter does not handle parts covered by previous match #426

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
See https://groups.yahoo.com/neo/groups/okapitools/conversations/topics/4478 
for details.

> The second rule really means „.“, i.e. always break.

Then, yes, I think there is a problem in the code: we don’t check the rule on 
the parts of the text included in the previous match.

Changing the code in SRXSegmenter.java from this:

m = rule.pattern.matcher(codedText);
while ( m.find() ) { 
int n = m.start()+m.group(1).length();
if ( n > codedText.length() ) continue;

To this:

m = rule.pattern.matcher(codedText);
int start = 0;
while ( m.find(start) ) { 
int n = m.start()+m.group(1).length();
start++;
if ( n > codedText.length() ) continue;

Should resolve this.

But there is side effect in the Aligner step tests.

Original issue reported on code.google.com by yves.sav...@gmail.com on 5 Dec 2014 at 4:07

GoogleCodeExporter commented 9 years ago
Aligner tests shown issue with the first solution (e.g. for pattern like 
"1.2.3. ". A better one:

int start = 0;
int prevStart = -1;
while (( start != prevStart ) && m.find(start) ) {
    int n = m.start()+m.group(1).length();
    // Set next start
    prevStart = start;
    start = n;
...

It passes all existing tests and additional ones.
I'll push this soon.

Original comment by yves.sav...@gmail.com on 7 Dec 2014 at 4:37

GoogleCodeExporter commented 9 years ago
This issue was closed by revision af5c6a381dcc.

Original comment by yves.sav...@gmail.com on 7 Dec 2014 at 4:39