computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

Ratel: Segmentation not possible when inline codes present #445

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

Segment the following Text in Ratel:
This is the first sentence.<x0/>This is the second sentence.

Using the following rules:
Before break: \.
After break: (<\w\d+/?>)

What is the expected output?
[This is the first sentence.][<x0/>This is the second sentence.]

What do you see instead?
[This is the first sentence.<x0/>This is the second sentence.]

So the break rule does not seem to work when the <x0/>-Tag. The regexp is 
working (checked it on other tools), but segmentation seems to fail.

What version of the product are you using? On what operating system?
0.26

Original issue reported on code.google.com by m...@sebastianebert.com on 23 Feb 2015 at 3:04

Attachments:

GoogleCodeExporter commented 9 years ago
I believe the segmentation library strips codes from the content before 
applying the rules, then reinserts them after segmentation.

Original comment by tingley on 24 Feb 2015 at 8:05

GoogleCodeExporter commented 9 years ago
I think you might be right. I tested the following example:

First sentence.<x0/>Sencond sentence.

Before break: \.
After break: \s

results in:
[first sentence.<x0/>Sencond sentence.]

However, if I put a space either before or after the tag, segmentation works:
[first sentence.][ <x0/>Sencond sentence.]

So I would propose not to strip the tags and reinserting them. In my case the 
source is an IDML file and <x0/> represents a line break. By the current 
behaviour it's not possible to do an adequate segmentation. I get hughe 
segments (whole paragraphs) consisting of multiple sentences. Any chance to 
change this?

Original comment by m...@sebastianebert.com on 24 Feb 2015 at 8:33

GoogleCodeExporter commented 9 years ago
I think this change is unlikely to be made, since I believe that something 
close to the opposite change has previously been made to arrive at the current 
behavior (see Issue 169).

The real issue is SRX itself, which doesn't actually specify a method for 
matching against an inline code.  Your regex -- which matches the literal text 
"<w0>" -- won't match real codes if it was used as part of a segmentation step 
in a processing pipeline.  SRX is a broken standard, basically.

I noticed, however, that if I try these rules, I get the result you want:
Before break: \.
After break: 

ie, the "after break" rule is the empty string.  This produces 
[This is the first sentence.][<x0/>This is the second sentence.]

for me.

Original comment by tingley on 27 Feb 2015 at 6:30