KorAP / Koral

:pencil: Translation of query languages to serialized KoralQuery protocol
BSD 2-Clause "Simplified" License
10 stars 4 forks source link

add the ~ operators from vanilla Poliqarp #119

Open bansp opened 2 years ago

bansp commented 2 years ago

TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.


Below, I paste a longer passage from Eureco meeting materials:

“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.

Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for any given segment; some examples of sentences containing an ambiguous segment which cannot be uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”

Source of the quote: https://dl.acm.org/doi/pdf/10.5555/1557769.1557795 “Poliqarp: an open source corpus indexer and search engine with syntactic extensions”, by Janus and Przepiórkowski

So at this point, the morphological info has two basic parts. One is the “traditional” part (<f name=”lex”>) with the added “translit” container, renamed to “orig”:

            <f name="lex"><!-- _zdarza-->  this is the “orth”, just for testing, can be suppressed
               <fs>
                  <f name="orig">zdarza</f> the original spelling, maybe with typos
                  <f name="lemma">zdarzać</f>   the “base” in Poliqarpish
                  <f name="pos">fin</f>
                  <f name="msd">sg:ter:imperf</f>    morphosyntactic info, may be missing
               </fs>            (recall that “orth” is recovered from the offsets)

The new part is <f name="interps"> – the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in <fs type="alt">. These are all potential values of the given token before disambiguation.

            <f name="interps">
               <fs type="alt" n="choice">
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:ncol" n="choice"/>
                        <symbol value="sg:acc:n:ncol"/>
                        <symbol value="sg:voc:n:ncol"/>
                     </vAlt>
                  </f>
               </fs>
               <fs type="alt">
                  <f name="lemma">doświadczyć</f>
                  <f name="pos">ger</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:perf:aff"/>
                        <symbol value="sg:acc:n:perf:aff"/>
                     </vAlt>
                  </f>
               </fs>
            </f>

Notice that there are two sets of alternatives: one is at the lexical level (`<fs type="alt">`), and the other, within a single lexical hypothesis, involves a set of alternative morphosyntactic descriptions, contained inside `<vAlt>` (which is TEI-speak for “alternative values”).

So, `<f name="lex">` is post-disambiguation, and in the last case at hand, it looks as follows:
```xml
            <f name="lex">
               <fs>
                  <f name="orig">doświadczenie</f>
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">sg:nom:n:ncol</f>
               </fs>
            </f>

and <f name="interps"> is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attribute n="choice", which points at the same info as what <f name="lex"> contains.


Note:

Akron commented 1 year ago

I am thinking about how to implement this to be universal useful.

For the moment I would guess we have to add another Layer for interpretations, like "pv" for pos-variants and "mv" for morphosyntactic variants. Then

where exclude() is the negative match operator, matching at the span of the first operand whenever no second operand has the same span.

Is that correct?