Open GoogleCodeExporter opened 9 years ago
I believe the segmentation library strips codes from the content before
applying the rules, then reinserts them after segmentation.
Original comment by tingley
on 24 Feb 2015 at 8:05
I think you might be right. I tested the following example:
First sentence.<x0/>Sencond sentence.
Before break: \.
After break: \s
results in:
[first sentence.<x0/>Sencond sentence.]
However, if I put a space either before or after the tag, segmentation works:
[first sentence.][ <x0/>Sencond sentence.]
So I would propose not to strip the tags and reinserting them. In my case the
source is an IDML file and <x0/> represents a line break. By the current
behaviour it's not possible to do an adequate segmentation. I get hughe
segments (whole paragraphs) consisting of multiple sentences. Any chance to
change this?
Original comment by m...@sebastianebert.com
on 24 Feb 2015 at 8:33
I think this change is unlikely to be made, since I believe that something
close to the opposite change has previously been made to arrive at the current
behavior (see Issue 169).
The real issue is SRX itself, which doesn't actually specify a method for
matching against an inline code. Your regex -- which matches the literal text
"<w0>" -- won't match real codes if it was used as part of a segmentation step
in a processing pipeline. SRX is a broken standard, basically.
I noticed, however, that if I try these rules, I get the result you want:
Before break: \.
After break:
ie, the "after break" rule is the empty string. This produces
[This is the first sentence.][<x0/>This is the second sentence.]
for me.
Original comment by tingley
on 27 Feb 2015 at 6:30
Original issue reported on code.google.com by
m...@sebastianebert.com
on 23 Feb 2015 at 3:04Attachments: