Open bansp opened 2 years ago
I am thinking about how to implement this to be universal useful.
For the moment I would guess we have to add another Layer for interpretations, like "pv" for pos-variants and "mv" for morphosyntactic variants. Then
[case=acc]
-> [nkjp/m=case:acc]
[case~acc]
-> [nkjp/m=case:acc | nkjp/mv=case:acc]
[case==acc]
-> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])
[case~~acc]
-> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])|[nkjp/m=case:acc & nkjp/mv=case:acc]
where exclude()
is the negative match operator, matching at the span of the first operand whenever no second operand has the same span.
Is that correct?
TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.
Below, I paste a longer passage from Eureco meeting materials:
“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.
Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for any given segment; some examples of sentences containing an ambiguous segment which cannot be uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”
Source of the quote: https://dl.acm.org/doi/pdf/10.5555/1557769.1557795 “Poliqarp: an open source corpus indexer and search engine with syntactic extensions”, by Janus and Przepiórkowski
So at this point, the morphological info has two basic parts. One is the “traditional” part (
<f name=”lex”>
) with the added “translit” container, renamed to “orig”:The new part is
<f name="interps">
– the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in<fs type="alt">
. These are all potential values of the given token before disambiguation.and
<f name="interps">
is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attributen="choice"
, which points at the same info as what<f name="lex">
contains.Note: