LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading. Python/Flask.
MIT License
349 stars 39 forks source link

Possibly ignore word accents when saving terms in DB #2

Open jzohrab opened 1 year ago

jzohrab commented 1 year ago

Notes from a slack chat:

I will attempt to articulate what I think the deal breaker could potentially be without getting into the weeds of how Ancient Greek actually works. You may have noticed that the words contain accented characters. There are various features of the language that cause those accents to change without changing anything about the meaning of the word. For example, I would have to define γὰρ twice because it can appear as either γὰρ OR γάρ. That's one of the most common words in the language meaning something like "for" or "since" or "because". Now, it only comes in those two flavors but I think you could see how quickly it would become tedious to define words over and over just because of diacritics.

With respect to this question accents, I have noticed that chrome is character agnostic when it does its "find in page" search. If I search γὰρ it will highlight γάρ and even γαρ. Perhaps Lute could have in the options specific to a language to ignore accents as well?

Let's continue to use the example of γὰρ, when I am reading the text, I would still see the orthography displayed as the author intended but, behind the scenes in the database, as far as Lute is concerned, γὰρ, γάρ, and γαρ share the same entry.

I know the original LWT let you do character substitutions but it actually just hotswapped one character for another and that fact was reflected in the actual text that you are reading. Basically it would see the character set as consisting of only 24 characters (not accounting for uppercase). The unaccented Greek alphabet.

My thoughts:

Rendered TextTokens (i.e., words shown in the reading pane) would include the accents, but Terms (stored in the db) would be without accents, and the rendered TextTokens would be associated to Terms w/o accents.

No idea at the moment if this would be tough or not!

firion1234 commented 1 year ago

There is a consideration that does, in fact, complicate the implementation of such a feature. I suspect the overwhelming majority of Ancient Greek dictionaries will not know what to do with unaccented words. Thus you are faced with the opposite problem that Latin presents. Most Latin dictionaries will reject the macrons. How do you specify which word to send to the dictionary? The accented one or the unaccented one?

jzohrab commented 1 year ago

Some misc notes only:

For sending accented vs unaccented - no idea - maybe the code that cycles through the dictionaries could send one and then the other. Not clear though.

M-Biggles commented 5 months ago

203 , I believe, would take care of this problem for the user and would hopefully be less of a change in other ways. The diacritics could be kept, and the variants treated as different words and linked to each other.

I just saw this older issue and thought it might have an easier solution.