Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
https://yomguithereal.github.io/talisman/
MIT License
704 stars 47 forks source link

FONEM Rule V-10 #175

Open gewy opened 4 years ago

gewy commented 4 years ago

Hi, Rule V-10 seams to be incorrect. The paper say : "Replace Y by I except if Y is between two vowels". TYOU and YOU should give TIOU and IOU and not be inchanged. Regards

Yomguithereal commented 4 years ago

Hello @gewy. Would you have some time to open a PR on the subject along with a unit test?

Yomguithereal commented 4 years ago

Hello @gewy. I just pushed a commit fixing rule V-10. I add to interpret some details of the paper to make this work because the way the algo is described is not completely sound. What do you think of the solution?

gewy commented 4 years ago

Hi, My implementation in Java : new Rule("V-10", "(?<=^|[^aeiouy])y|y(?=[^aeiouy]|$)", "I"); Test on vowels is not necessary IMHO. Having consonant on one side (or ^$) is enough to proove that we don't have vowels on both sides.

BTW I will check but I am not sure that C-27 and C-28 are corrects either.

Yomguithereal commented 4 years ago

new Rule("V-10", "(?<=^|[^aeiouy])y|y(?=[^aeiouy]|$)", "I");

Unfortunately JavaScript does not support lookbehind assertions in regex (at least not all engines, since lookbehinds were added recently to the specs).

BTW I will check but I am not sure that C-27 and C-28 are corrects either.

Fair enough. Tell me when you know and I'll make the required changes on my side.

gewy commented 4 years ago

new Rule("V-10", "(^|[^aeiouy])y|y([^aeiouy]|$)", "$1I$2"); do not work in JS ??

gewy commented 4 years ago

C-27 the document says Z with vowels BEFORE and you regex is Z(?=${V})

gewy commented 4 years ago

C-28 exclude SS between vowels, your regex check the right side only (cf. V-10)

Yomguithereal commented 4 years ago

I have simplified V-10 rule as per your suggestion. Concerning C-27, I have an interpretation question: should OZOUADE finally be OSWADE then (I am fine with this). But should POUYEZ become POUYES as per C-27 (I am less fine with this). Sorry if this is obvious but I did not read this paper since a very long time.

Yomguithereal commented 4 years ago

I have updated rule C-28.

gewy commented 4 years ago

Same feeling about rules. Anyway I was using uppercase and lowercase to easily identify the applied rules for my testing. Then I add the CASE_INSENSITIVE property to the matcher object.

Le mer. 2 sept. 2020 à 17:22, Guillaume Plique notifications@github.com a écrit :

Also, I rely on some weird adhoc rule ordering because the paper's rules were not finely thought out but you seem to rely on an uppercase/lowercase trick to do the same. Do you find it easier likewise?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Yomguithereal/talisman/issues/175#issuecomment-685808548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPWWDSFWDFU7NUU6LM723SDZPL5ANCNFSM4QGWQY3A .

Yomguithereal commented 4 years ago

So what did you choose regarding C-27? Do you get POUYES?

gewy commented 4 years ago

Well, with all the phonetic algorithms on family names I have tested, I had counter examples. If you try to change a rule for one case you will probably trigger other weird cases.

Le mer. 2 sept. 2020 à 17:21, Guillaume Plique notifications@github.com a écrit :

I have simplified V-10 rule as per your suggestion. Concerning C-27, I have an interpretation question: should OZOUADE finally be OSWADE then (I am fine with this). But should POUYEZ become POUYES as per C-27 (I am less fine with this).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Yomguithereal/talisman/issues/175#issuecomment-685807894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPWWAHAFEQAK4YB4Y6MKDSDZPIHANCNFSM4QGWQY3A .

gewy commented 4 years ago

Yes, I try to apply the rules strictly as they are in the document (or as I understand them...) Anyway I am more disturbed by this cases : MAINARD -> MINAR MENNAR -> MENAR MEINNART -> MEINAR RAIMOND -> RINON RAYMOND -> RAIMON May be linked to the rules order. (V-18)rINond[rINon]RAIMOND -> RINON (V-10)raImond[raImon]RAYMOND -> RAIMON If I put V-10 before V-18 (V-18)rINond[rINon]RAIMOND -> RINON (V-10)raImondrINond[rINon]RAYMOND -> RINON Anyway : REIMON -> REIMON (C-28)remont[remon]REMMONT -> REMON REMON -> REMON

Anyway I still don't have validate the choice to use this algorithm.

Le jeu. 3 sept. 2020 à 10:53, Guillaume Plique notifications@github.com a écrit :

So what did you choose regarding C-27? Do you get POUYES?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Yomguithereal/talisman/issues/175#issuecomment-686351379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPWWGVNQODSQTQ2CKTWXTSD5KRZANCNFSM4QGWQY3A .

Yomguithereal commented 4 years ago

Yes, this algorithm is not very good outside of its original goal to match names from Saguenay etc. I work on a personal algorithm for French that is way better but is geared to keep vocalization.