Open craigberry opened 9 months ago
The workaround, or fix, depending on your point of view, is to replace this line in the configuration:
containerPath: .//text
with this:
containerPath: ./text
I assume this works because it now does everything relative to the top-level <text>
element immediately under <TEI>
instead of sometimes the top-level and sometimes the nested one, so there are no longer any offset mismatches.
In versions of BlackLab prior to 3.x, we simply had text
without any punctuation indicating an XPath specification, and that is still what's in all the examples for containerPath in the documentation. But in BlackLab 3.x and later that silently produces an empty index. It took a lot of guessing to figure out what should replace that, and I apparently made an unlucky guess.
Yes, that's a good way to address this issue for now. With .//text
, what happens is that it finds all <w/>
tags twice, because it looks for matching word tags for each match of the containerPath
separately.
Arguably we should concatenate containerPath
and wordPath
and find matches for that expression, which would eliminate this problem. I'll think about whether that breaks anything that I can think of, and implement it if not.
I also tried your example with containerPath: text
and that does the same as containerPath: ./text
as far as I can see; no error messages and the resulting index contains two words. So I'm not sure how to reproduce this problem introduced since v3.x you describe. Do you have a similar example where it fails?
Thanks for the reply. I can't reproduce the problem now with containerPath: text
. I must have had something else wrong with my configuration and changed that along with containerPath
to get past that. Sorry for the false alarm.
With a current checkout of the dev branch and Java 11:
and attempting to index this tiny TEI-like document boiled down from a much larger real-world example:
with this simple input format configuration file:
I observe the following crash:
If the document doesn't have nested
<text>
elements, the crash goes away. It also goes away if I don't use the saxon processor or if I have a document with only one<w>
element rather than two.I don't know anything about how BlackLab uses offsets, but I infer that some of the time the offset gets caculated relative to the nearest ancestor
<text>
element and sometimes relative to the great-grandparent and the two don't match.I'm attaching a zip file containing the reproducer files quoted above.
bloffsetbug.zip