How to deal with lemmas with multiple values?

baudbaudy commented 3 months ago

Hi there! First of all thank you for your work on BlackLab. I have a problem and I haven't found a solution in the documentation or issues on github. I work on French letters and in our TEI files some of our words can have several lemmas (example:<w lemma="LE|À" pos="art. def.|prép.">au</w>).

In the corpus-frontend, searching by word works very well, however searching for the lemma "LE|À" does not provide any results and searching only for the lemma "LE"or "À" does not find the word "au". Do you have any solutions to suggest to me to resolve this problem?

Thank you for your time and assistance.

KCMertens commented 3 months ago

You're being bitten by a feature in the simple, extended and advanced search unfortunately. Everything entered is treated verbatim, except for these 3 special cases:

The '|' character is treated as or
* is zero or more characters
? is any one character

More specifically, what you enter converted to regex, and these 3 are substituted in the following way:

| -> | (left alone)
* -> .*
? -> .

Unfortunately, you can't bypass this at the moment, so to find the | literally, you'll have to use the expert view and enter the regex yourself. For your example that would look like this: [lemma="LE\|À"] (note the escaping backslash \ before the pipe |).

Sidebar: BlackLab supports multiple values, so what you could also do is index both the full value and the individual values for the lemma and pos. The token will then match for any of the values.

You could do this as follows:

annotatedFields:
  contents:
    annotations:
      - name: lemma
        displayName: Lemma
        valuePath: "@lemma"
        multipleValues: true
        allowDuplicateValues : false
        process:
          - action: split
            separator: "\\|"
            keep: both

      - name: pos
        displayName: Part of Speech
        valuePath: "@pos"
        multipleValues: true
        allowDuplicateValues : false
        process:
          - action: split
            separator: "\\|"
            keep: both

The split process option is explained here: https://inl.github.io/BlackLab/guide/how-to-configure-indexing.html#processing-values

There is a caveat though: There's 3 values for lemma (['LE|À', 'LE', 'À']), but only the first value on any token can be shown in the UI. That is also what is used when sorting or grouping the results (for example, grouping on lemma would put your example word in the LE|À group only, not in the group for LE or À.

baudbaudy commented 3 months ago

Very good, thank you for your response and advice.

INL / corpus-frontend

How to deal with lemmas with multiple values? #519