alpheios-project / alpheios-core

Alpheios Core Javascript Packages and Libraries
15 stars 2 forks source link

Remove numbers from lemma in wordlist #623

Closed monzug closed 2 years ago

monzug commented 3 years ago

Remove numbers from lemma in wordlist, see #599 for reference. double click on omnia in https://texts-test.alpheios.net/text/urn:cts:latinLit:phi0959.phi006.alpheios-text-lat1/passage/1.1-1.30 the lemma is omnis1, omne, omnia in wordlist. also when limit to lemma, the number is still in the lemma. (note that the lemma in wordlist from a lookup word does non have any number)

Screen Shot 2021-01-21 at 2 42 06 PM

lemma in wordlist from lookup word Screen Shot 2021-01-21 at 3 27 28 PM

balmas commented 3 years ago

please make the fix in a branch off of the incr-3.3.x branch.

kirlat commented 3 years ago

This issue seems to be related to the usage of Treebank data. It does not occur when the Treebank data is not involved.

I am able to reproduce it and is on my way fixing it.

monzug commented 3 years ago

Yes, it's treebank related. see also #599

kirlat commented 3 years ago

The variant returned by the Treebank is omnis1, and due to how our current disambiguation works, we prefer that variant for the lemma.word over the Tuft's value of omnis.

The disambiguation function that decides this is https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lemma.js#L221-L266. Both words has no mixed cases and both do not need normalization, so we skip any logic related to that. But the otherLemma (which is the treebank's value) happen to have a digit at the end, so we prefer it over the Tuft's value: https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lemma.js#L258.

With lexeme.setDisambiguation() https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lexeme.js#L238-L243 we replace the lemma.word value only, but lemma.prinicpalParts field is not update and remains the same as it is in the Tuft's results (i.e. with omnis, not omnis1).

As a result of the disambiguation process, the first lexeme of the word contains the following information:

lemma.word: 'omnis1'
lemma.prinicipalParts: `['omnis', 'omnis', 'omne']`

Why we do not see in the omnis1 in the popup? Because we display there principal parts only. But we do see it in the wordlist, were lemma.word fields are used.

This does not happen when there is no treebank data on the page: we simply take the Tuft's value of omnis and the disambiguation does not occur.

I'm not sure what would be the correct course of action here, especially considering that the change may affect so many use cases on potentially multiple languages. @balmas, @monzug, what algorithm would help to fix that?

monzug commented 3 years ago

Unfortunately I don't know what the algorithm can be. I thought it was related to #599 but it might not. I cannot give much input here, except an other example Screen Shot 2021-01-26 at 2 55 43 PM

monzug commented 3 years ago

and from lookup Screen Shot 2021-01-26 at 3 24 56 PM

balmas commented 3 years ago

@kirlat for #141 we introduced a displayWord method on Lemma, which produces a version of the lemma stripped of trailing digits for display purposes only. As you noted, we need to keep the original lemma with the digit intact, for matching against treebanks and dictionaries, but the digit is meaningless for most end-users.

I think the wordlist could just use this function on lemma to display the lemma without the digit. I think we might want to make the logic slightly more sophisticated to handle the case of multiple lemmas with different digits for a single form.

For example, say domuisse had both domo1 and domo2 had lemmas, then we would probably want to keep those digits rather than stripping and deduping, because it indicates to the user that there are two distinct words.

But domo and domo1 could be deduped.

so with the following rules:

if 2 lemmas, identical except for a 1 at the end, then show only one, without the 1 if 2 lemmas, identical except for different digits at the end, then show both with the digits

Does that make sense?

kirlat commented 3 years ago

if 2 lemmas, identical except for a 1 at the end, then show only one, without the 1 if 2 lemmas, identical except for different digits at the end, then show both with the digits

By saying "show" do you mean that we should adjust the display logic rather than that of a disambiguation?

Let's consider the first scenario, when one word has 1 at the end and the other has no digits. Right now such words will be disambiguated, and we'll keep the variant with the 1. Because of that during the display phase we cannot chose what to show because the other value (without the digit) will be lost during the disambiguation. Do you think we should keep both variants and decide which one to display? Or should we disambiguate as we do now and then strip 1 while displaying it in the wordlist? In that case we should probably mark this lexeme somehow to indicate that the digit needs to be removed during the display phase.

In the second scenario, I believe, two words will not be disambiguated and it will work as described, but I need to verify that.

balmas commented 3 years ago

Actually @kirlat this is not a question of disambiguation at all. It's about what lemmas we actually display in the wordlist. So yes to the following:

Or should we disambiguate as we do now and then strip 1 while displaying it in the wordlist? In that case we should probably mark this lexeme somehow to indicate that the digit needs to be removed during the display phase.

I'm not sure if we need to mark the lexeme at all -- couldn't the decision be made by the display component? That's essentially what happens in the popup -- e.g. see

https://github.com/alpheios-project/alpheios-core/blob/master/packages/components/src/vue/components/morph-parts/principal-parts.vue#L10-L13

which calls the lemma.displayWord function

kirlat commented 3 years ago

Thanks for explaining! So I'll apply lemma.displayWord() to the lemmas that are displayed in the word list.

irina060981 commented 2 years ago

This is implemented - @monzug , check this please

monzug commented 2 years ago

tested some of the above Latin words. all fixed. yeahh

Screen Shot 2021-11-29 at 3 17 14 PM