INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
103 stars 53 forks source link

Switch default XML parser to Saxon? #320

Open jan-niestadt opened 2 years ago

jan-niestadt commented 2 years ago

BlackLab uses the XML library VTD-XML by default for processing documents while indexing. This only supports XPath 1.0.

@eduarddrenth made it possible to use Saxon, a more feature-rich (supports XPath 3) and potentially faster alternative, but it does use more memory while indexing. This may not be a problem in most cases, however.

We should consider changing the default to Saxon, while keeping VTD-XML available for those who want it. If we decide to do this, we should be careful about breaking backwards compatibility.

One solution would be to version .blf.yaml files. e.g. if the file starts with

version: 2

# What element starts a new document?
documentPath: //document

...

it automatically defaults to Saxon instead of VTD-XML. We should clearly document the change as well, of course.

Some older (and, dare I say, janky) features could be deprecated if Saxon's better XPath support obviates the need for them.

jan-niestadt commented 1 year ago

Multiple values are now supported, see #393 and #394. Using processing steps on annotations or standoffAnnotations produces an error. Those can likely be done in XPath 3, so therefore wouldn't need a special feature anymore. We still need to test this more before thinking of switching the default parser though.