INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
103 stars 53 forks source link

Highlighting in documents doesn't handle CDATA sections correctly #521

Open jan-niestadt opened 3 months ago

jan-niestadt commented 3 months ago

"Tags" inside a CDATA are seen as actual (unbalanced) XML open tags, and closing tags are added at the end of the document.

Example:

https://portal.clarin.ivdnt.org/blacklab-server-new/opensonar/docs/WR-P-E-C-0000000129/contents?query=%5Bword%3D%22schip%22%5D&wordstart=7000

jan-niestadt commented 3 months ago

Attempted fix in 981b8a5de, but this causes a StackOverflowError in the regex evaluation (even though small-scale test succeeds). Possibly the document is too large for handling this way. See https://stackoverflow.com/a/7510006

jan-niestadt commented 3 months ago

It might be time to consider rewriting highlighting of document fragments using something like https://jsoup.org/

If that doesn't seem practical for whatever reason, another alternative is to loop through the document character by character, only using regexes whenever we find a < (and possibly not using them at all for comments or CDATA, which can get large, unlike tags).

jan-niestadt commented 2 months ago

@KCMertens mentioned that Saxon's parsing can be customized as well, including how to deal with unbalanced tags; maybe this could be a good solution

KCMertens commented 2 months ago

Here is how I've done it in the past using commandline arguments: https://github.com/INL/vws-conversie/blob/master/saxon/run-xslt-tagsoup.sh#L28 That's different of course, but docs for doing it programatically are here: https://saxonica.com/html/documentation9.6/sourcedocs/controlling-parsing.html

Tagsoup specifically is written to be lenient.

jan-niestadt commented 2 months ago

I think TagSoup would transform e.g.

a snippet.</s> <s>It starts halfway through a sentence!</s>

to

a snippet.<s></s> <s>It starts halfway through a sentence!</s>

instead of

<s>a snippet.</s> <s>It starts halfway through a sentence!</s>

The latter is what we currently do and (arguably) what we need for our purposes. (although we should probably add an ellipsis inside the new start tag, e.g. <s>… to show that some words are probably missing there)