UAlbertaALTLab / morphodict

The Language Independent Intelligent Dictionary
https://morphodict.readthedocs.io/
Apache License 2.0
22 stars 11 forks source link

morphodict: specify interface for "wordform normalization" #477

Open eddieantonio opened 4 years ago

eddieantonio commented 4 years ago

strip_cree_diacritics currently exists in many places; we'll need to find where it is used, what its interface is, and split it in two places:

aarppe commented 4 years ago

Shouldn't we consider "undoing/removing/stripping" the diacritics simply one component of fuzzy search? That is, the conversion of characters with macron/circumflex to ones without them (or vice versa) as edits, with a costs that is lower than normal? Then, this can be considered alongside other edits (e.g. ch -> c, u(h) -> a/â, ee -> î), each with their own costs.

If so, might it make sense to make use of a weighted (H)FST based on weighed rewrite rules? This is relatively simple to make (I've got a trial version). I'm not sure one would necessarily want to compose that directly with the actual FST - this could be a case for virtual composition.

eddieantonio commented 4 years ago

Shouldn't we consider "undoing/removing/stripping" the diacritics simply one component of fuzzy search? That is, the conversion of characters with macron/circumflex to ones without them (or vice versa) as edits, with a costs that is lower than normal? Then, this can be considered alongside other edits (e.g. ch -> c, u(h) -> a/â, ee -> î), each with their own costs.

If so, might it make sense to make use of a weighted (H)FST based on weighed rewrite rules? This is relatively simple to make (I've got a trial version). I'm not sure one would necessarily want to compose that directly with the actual FST - this could be a case for virtual composition.

This is not relevant to the issue opened here. This issue concerns moving the existing functionality between a language-agnostic socket, and the existing language-specifc code. A refactoring should not affect functionality or implementation.