Open jan-niestadt opened 3 months ago
Attempted fix in 981b8a5de, but this causes a StackOverflowError in the regex evaluation (even though small-scale test succeeds). Possibly the document is too large for handling this way. See https://stackoverflow.com/a/7510006
It might be time to consider rewriting highlighting of document fragments using something like https://jsoup.org/
If that doesn't seem practical for whatever reason, another alternative is to loop through the document character by character, only using regexes whenever we find a <
(and possibly not using them at all for comments or CDATA, which can get large, unlike tags).
@KCMertens mentioned that Saxon's parsing can be customized as well, including how to deal with unbalanced tags; maybe this could be a good solution
Here is how I've done it in the past using commandline arguments: https://github.com/INL/vws-conversie/blob/master/saxon/run-xslt-tagsoup.sh#L28 That's different of course, but docs for doing it programatically are here: https://saxonica.com/html/documentation9.6/sourcedocs/controlling-parsing.html
Tagsoup specifically is written to be lenient.
I think TagSoup would transform e.g.
a snippet.</s> <s>It starts halfway through a sentence!</s>
to
a snippet.<s></s> <s>It starts halfway through a sentence!</s>
instead of
<s>a snippet.</s> <s>It starts halfway through a sentence!</s>
The latter is what we currently do and (arguably) what we need for our purposes. (although we should probably add an ellipsis inside the new start tag, e.g. <s>…
to show that some words are probably missing there)
"Tags" inside a CDATA are seen as actual (unbalanced) XML open tags, and closing tags are added at the end of the document.
Example:
https://portal.clarin.ivdnt.org/blacklab-server-new/opensonar/docs/WR-P-E-C-0000000129/contents?query=%5Bword%3D%22schip%22%5D&wordstart=7000