Closed monzug closed 2 years ago
please make the fix in a branch off of the incr-3.3.x branch.
This issue seems to be related to the usage of Treebank data. It does not occur when the Treebank data is not involved.
I am able to reproduce it and is on my way fixing it.
Yes, it's treebank related. see also #599
The variant returned by the Treebank is omnis1
, and due to how our current disambiguation works, we prefer that variant for the lemma.word
over the Tuft's value of omnis
.
The disambiguation function that decides this is https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lemma.js#L221-L266. Both words has no mixed cases and both do not need normalization, so we skip any logic related to that. But the otherLemma
(which is the treebank's value) happen to have a digit at the end, so we prefer it over the Tuft's value: https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lemma.js#L258.
With lexeme.setDisambiguation()
https://github.com/alpheios-project/alpheios-core/blob/incr-3.3.x/packages/data-models/src/lexeme.js#L238-L243 we replace the lemma.word
value only, but lemma.prinicpalParts
field is not update and remains the same as it is in the Tuft's results (i.e. with omnis
, not omnis1
).
As a result of the disambiguation process, the first lexeme of the word contains the following information:
lemma.word: 'omnis1'
lemma.prinicipalParts: `['omnis', 'omnis', 'omne']`
Why we do not see in the omnis1
in the popup? Because we display there principal parts only. But we do see it in the wordlist, were lemma.word
fields are used.
This does not happen when there is no treebank data on the page: we simply take the Tuft's value of omnis
and the disambiguation does not occur.
I'm not sure what would be the correct course of action here, especially considering that the change may affect so many use cases on potentially multiple languages. @balmas, @monzug, what algorithm would help to fix that?
Unfortunately I don't know what the algorithm can be. I thought it was related to #599 but it might not. I cannot give much input here, except an other example
and from lookup
@kirlat for #141 we introduced a displayWord
method on Lemma, which produces a version of the lemma stripped of trailing digits for display purposes only. As you noted, we need to keep the original lemma with the digit intact, for matching against treebanks and dictionaries, but the digit is meaningless for most end-users.
I think the wordlist could just use this function on lemma to display the lemma without the digit. I think we might want to make the logic slightly more sophisticated to handle the case of multiple lemmas with different digits for a single form.
For example, say domuisse
had both domo1 and domo2 had lemmas, then we would probably want to keep those digits rather than stripping and deduping, because it indicates to the user that there are two distinct words.
But domo and domo1 could be deduped.
so with the following rules:
if 2 lemmas, identical except for a 1 at the end, then show only one, without the 1 if 2 lemmas, identical except for different digits at the end, then show both with the digits
Does that make sense?
if 2 lemmas, identical except for a 1 at the end, then show only one, without the 1 if 2 lemmas, identical except for different digits at the end, then show both with the digits
By saying "show" do you mean that we should adjust the display logic rather than that of a disambiguation?
Let's consider the first scenario, when one word has 1
at the end and the other has no digits. Right now such words will be disambiguated, and we'll keep the variant with the 1
. Because of that during the display phase we cannot chose what to show because the other value (without the digit) will be lost during the disambiguation. Do you think we should keep both variants and decide which one to display? Or should we disambiguate as we do now and then strip 1
while displaying it in the wordlist? In that case we should probably mark this lexeme somehow to indicate that the digit needs to be removed during the display phase.
In the second scenario, I believe, two words will not be disambiguated and it will work as described, but I need to verify that.
Actually @kirlat this is not a question of disambiguation at all. It's about what lemmas we actually display in the wordlist. So yes to the following:
Or should we disambiguate as we do now and then strip 1 while displaying it in the wordlist? In that case we should probably mark this lexeme somehow to indicate that the digit needs to be removed during the display phase.
I'm not sure if we need to mark the lexeme at all -- couldn't the decision be made by the display component? That's essentially what happens in the popup -- e.g. see
which calls the lemma.displayWord function
Thanks for explaining! So I'll apply lemma.displayWord()
to the lemmas that are displayed in the word list.
This is implemented - @monzug , check this please
tested some of the above Latin words. all fixed. yeahh
Remove numbers from lemma in wordlist, see #599 for reference. double click on omnia in https://texts-test.alpheios.net/text/urn:cts:latinLit:phi0959.phi006.alpheios-text-lat1/passage/1.1-1.30 the lemma is omnis1, omne, omnia in wordlist. also when limit to lemma, the number is still in the lemma. (note that the lemma in wordlist from a lookup word does non have any number)
lemma in wordlist from lookup word