INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
104 stars 53 forks source link

containerPath and nested elements (WAS: offset problem...) #486

Open craigberry opened 9 months ago

craigberry commented 9 months ago

With a current checkout of the dev branch and Java 11:

$ git describe
v4-alpha2-34-g16ef16df
$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment Temurin-11.0.20.1+1 (build 11.0.20.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.20.1+1 (build 11.0.20.1+1, mixed mode)

and attempting to index this tiny TEI-like document boiled down from a much larger real-world example:

$ cat in/group_text_offset_bug.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <text>
 <group>
   <text>
      <p>
         <w>test1</w>
         <w>tes2t</w>
      </p>
   </text>
  </group>
 </text>
</TEI>

with this simple input format configuration file:

$ cat offsetbug.blf.yaml
displayName: OffsetBug
description: Offset bug with grouped texts

# comment out this line to avoid IllegalArgumentException
processor: saxon

namespaces:
  '': http://www.tei-c.org/ns/1.0    # The default namespace
  ep: http://earlyprint.org/ns/1.0

documentPath: //TEI

annotatedFields:

  contents:

    containerPath: .//text

    wordPath: .//w

    annotations:

      # Text of the <w/> element contains the word form
    - name: word
      valuePath: .
      sensitivity: sensitive_insensitive

I observe the following crash:

$ BLPATH=~/repos/Blacklab/core/target/
$ CLASSPATH="$BLPATH/blacklab-4.0.0-SNAPSHOT.jar":"$BLPATH"/lib
$ java -cp "$CLASSPATH" nl.inl.blacklab.tools.IndexTool create out 'in/*.xml' offsetbug
Creating new index in out/ from in/*.xml (using format offsetbug)
08:29:23.704 [main] WARN  nl.inl.blacklab.index.DocumentFormats - Overwriting existing config format offsetbug with config-based input format 'offsetbug' (read from /Users/craig/bloffsetbug/offsetbug.blf.yaml).
08:29:23.710 [main] WARN  nl.inl.blacklab.search.BlackLabEngine - YOUR DOCUMENT IDs ARE NOT PERSISTENT! The input format offsetbug does not specify a persistent identifier (pid) field. This will work, but random ids will be assigned to your documents every time you index. So reindexing may assign totally different document ids, and any saved links to documents will break. To fix this, specify a pidField using the corpusConfig.specialFields.pidField setting of your input format configuration (.blf.yaml file).
08:29:23.826 [main] WARN  nl.inl.blacklab.search.indexmetadata.IndexMetadataAbstract - No titleField specified; using default fromInputFile. In future versions, no default will be chosen.
An error occurred during indexing!
error: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=127,endOffset=139,lastStartOffset=149 for field 'contents%word@s' (in /Users/craig/bloffsetbug/in/group_text_offset_bug.xml)
nl.inl.blacklab.exceptions.BlackLabRuntimeException: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=127,endOffset=139,lastStartOffset=149 for field 'contents%word@s'
    at nl.inl.blacklab.exceptions.BlackLabRuntimeException.wrap(BlackLabRuntimeException.java:16)
    at nl.inl.blacklab.indexers.config.DocIndexerBase.endDocument(DocIndexerBase.java:403)
    at nl.inl.blacklab.indexers.config.DocIndexerConfig.endDocument(DocIndexerConfig.java:722)
    at nl.inl.blacklab.indexers.config.DocIndexerXPath.indexDocument(DocIndexerXPath.java:397)
    at nl.inl.blacklab.indexers.config.DocIndexerXPath.lambda$indexParsedFile$12(DocIndexerXPath.java:525)
    at nl.inl.blacklab.indexers.config.saxon.XPathFinder.xpathForEach(XPathFinder.java:132)
    at nl.inl.blacklab.indexers.config.DocIndexerSaxon.xpathForEach(DocIndexerSaxon.java:157)
    at nl.inl.blacklab.indexers.config.DocIndexerSaxon.xpathForEach(DocIndexerSaxon.java:43)
    at nl.inl.blacklab.indexers.config.DocIndexerXPath.indexParsedFile(DocIndexerXPath.java:520)
    at nl.inl.blacklab.indexers.config.DocIndexerSaxon.index(DocIndexerSaxon.java:188)
    at nl.inl.blacklab.index.IndexerImpl$DocIndexerWrapper.impl(IndexerImpl.java:115)
    at nl.inl.blacklab.index.IndexerImpl$DocIndexerWrapper.file(IndexerImpl.java:84)
    at nl.inl.util.FileProcessor.lambda$processFile$7(FileProcessor.java:489)
    at nl.inl.util.FileProcessor.lambda$makeRunnable$9(FileProcessor.java:586)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=127,endOffset=139,lastStartOffset=149 for field 'contents%word@s'
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:955)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:527)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:491)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
    at nl.inl.blacklab.index.BLIndexWriterProxyLucene.addDocument(BLIndexWriterProxyLucene.java:24)
    at nl.inl.blacklab.search.BlackLabIndexWriter.addDocument(BlackLabIndexWriter.java:84)
    at nl.inl.blacklab.index.IndexerImpl.add(IndexerImpl.java:410)
    at nl.inl.blacklab.indexers.config.DocIndexerBase.endDocument(DocIndexerBase.java:401)
    ... 16 more
Saving index, please wait...
0 docs (0 B, 0 tokens); avg. 0.0k tok/s (0.0 MB/s); currently 0.0k tok/s (0.0 MB/s); 513 ms elapsed
Done. Elapsed time: 0 seconds
Finished!

If the document doesn't have nested <text> elements, the crash goes away. It also goes away if I don't use the saxon processor or if I have a document with only one <w> element rather than two.

I don't know anything about how BlackLab uses offsets, but I infer that some of the time the offset gets caculated relative to the nearest ancestor <text> element and sometimes relative to the great-grandparent and the two don't match.

I'm attaching a zip file containing the reproducer files quoted above.

bloffsetbug.zip

craigberry commented 9 months ago

The workaround, or fix, depending on your point of view, is to replace this line in the configuration:

    containerPath: .//text

with this:

    containerPath: ./text

I assume this works because it now does everything relative to the top-level <text> element immediately under <TEI> instead of sometimes the top-level and sometimes the nested one, so there are no longer any offset mismatches.

In versions of BlackLab prior to 3.x, we simply had text without any punctuation indicating an XPath specification, and that is still what's in all the examples for containerPath in the documentation. But in BlackLab 3.x and later that silently produces an empty index. It took a lot of guessing to figure out what should replace that, and I apparently made an unlucky guess.

jan-niestadt commented 9 months ago

Yes, that's a good way to address this issue for now. With .//text, what happens is that it finds all <w/> tags twice, because it looks for matching word tags for each match of the containerPath separately.

Arguably we should concatenate containerPath and wordPath and find matches for that expression, which would eliminate this problem. I'll think about whether that breaks anything that I can think of, and implement it if not.

I also tried your example with containerPath: text and that does the same as containerPath: ./text as far as I can see; no error messages and the resulting index contains two words. So I'm not sure how to reproduce this problem introduced since v3.x you describe. Do you have a similar example where it fails?

craigberry commented 9 months ago

Thanks for the reply. I can't reproduce the problem now with containerPath: text. I must have had something else wrong with my configuration and changed that along with containerPath to get past that. Sorry for the false alarm.