KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

SpanFocusQuery() needs to respect classes outside the sub span #48

Open Akron opened 5 years ago

Akron commented 5 years ago

Follow up to https://github.com/KorAP/Krill/issues/7 :

Currently focus queries require that classes are part of sub spans. Unfortunately there are cases where this is not true.

Example: The span <a>...{1:...}...{2:...}...</a> is modified using focus(2:...), but still contains a class 1. Now, if the span is again modified using focus(1:...) a preceding match span may be <a>...{2:...}...{1:...}...</a>, so the second class 1 may precede the first class 1.

Akron commented 5 years ago

A naive approach would simply remove classes from payloads that are outside the scope of the current span in the focus query. However, there are probably valid queries that require intact classes and multiple nested focus queries that wouldn't work anymore (e.g. reference queries).

Akron commented 5 years ago

A better approach may be to let the focus query keep track of the largest span (by adding a payload with a fixed class number > 128 including minimal start and maximum end position), and taking this into account (in case it's set) instead of the current wrapped query (see https://github.com/KorAP/Krill/issues/7) when comparing with the highest priority matches.

Akron commented 5 years ago

An optimization to this approach would involve an attribute like "keepTrack", that advises a focus query to alter the payload in case it is wrapped by another focus query. This could be done in the optimization toQuery() phase.

margaretha commented 5 years ago

A better approach may be to let the focus query keep track of the largest span (by adding a payload

It sounds good and quite simple!