UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary
https://itwewina.altlab.app/
Apache License 2.0
22 stars 11 forks source link

Searches result in duplicate dictionary entries #374

Open aarppe opened 4 years ago

aarppe commented 4 years ago

I'm seeing at least a few cases where searches result in the same Cree dictionary entry being presented twice, e.g. searching with tan'si with tânisi:

image

Of course, the dictionary entry for tânisi should be shown only once.

However, searching with tânisi gives only one result, as expected:

aarppe commented 4 years ago

@Madoshakalaka This is one more for you.

Madoshakalaka commented 4 years ago

@aarppe sure thing :rofl:

eddieantonio commented 4 years ago

@Madoshakalaka, I think this is because results from the FST aren't being de-duplicated. See here:

$ echo "tan'si" | hfst-optimized-lookup crk-descriptive-analyzer.hfstol
tan'si  tânisi+Ipc+Err/Orth
tan'si  tânisi+Ipc

This one is tricky, because according to the descriptive analyzer, there are TWO valid analyses with different tags @aarppe, how should we handle this situation? According to the FST, having two results for tan'si is correct — the FST yields two results with different analyses!

aarppe commented 4 years ago

The latest FST gives in fact three analyses:

echo "tan'si" | hfst-lookup -q src/analyser-gt-desc.hfst
tan'si  tânisi+Ipc+Err/Orth 0.000000
tan'si  tânisi+Ipc  0.000000
tan'si  tânisi+Ipc+Interj   0.000000

This is in particular tricky since the spell-relax corrections are tagged with +Err/Orth, and swapping an apostrophe for a short-i is one of the spell-relax rules.

I had previously revised the list of non-standard forms LEXC file to include only those spelling deviances that cannot be dealt with spell-relax rules, to avoid double analyses. I'll comment out the tan'si form in src/morphology/stems/non_standard.lexc.

What is even further tricky is that there are two legitimate lemmas for tânisi. One which is an interrogative/adverbial particle (the first one below), and the other which is an interjection (the second one below):

tânisi  IPC how, in what way
tânisi  IPJ hello, how are you

So, we'd have to use a feature pair to match the second one, and the lack of any additional features (exact match) to match the first one.

nienna73 commented 2 years ago

This behaviour is no longer the same. In fact, the above definitions for tan'si are no longer in the dictionary.

fbanados commented 1 month ago

As part of recent fixes in crk-db, definitions are returning. The issue remains to be addressed. If I am correct in the discussion currently going on in UAlbertaALTLab/crk-db#119, and from my understanding of the last comment by @aarppe , there should not be repeated definitions as presented in the issue image, but still there should be two entries: one for IPC and one for IPJ. That would leave the sense how, in what way from CW in its own entry (IPC) and the how are you in a separate entry (IPJ). as the data in the MD dictionary stands, it would merge the definition into the IPC one, as follows:

Screenshot 2024-07-08 at 2 45 34 PM

My gut feeling tells me that this is not the expected match for the MD entry, but that the MD entry should go in the other one. To achieve that, it would be sufficient to change the FST Analysis for the entry in Maskwacis.tsv to add the +Interj tag, if the discussion from UAlbertaALTLab/crk-db#119 is resolved as IPJ == +Ipc+Interj.

fbanados commented 1 month ago

Updates to merging using the actual analysis help fix the sense matching issue:

Screenshot 2024-07-09 at 12 31 20 PM