geajack / Wordology

A WebExtensions browser extension for aiding language learning.
MIT License
20 stars 9 forks source link

My various ideas #2

Closed marbuljon closed 5 years ago

marbuljon commented 6 years ago

Hi! I'm having a problem where most words can't be recognized. First I'd like to note that this is an issue that does affect ALL "European" languages, I'll post examples below.

English is a language where you can, technically, put every single word into the dictionary one-by-one and use it that way because people almost never make up new words (as in, English's compound words are sort of set in stone). Still, it'd help the user and speed up the dictionary-making process greatly if you could put in suffix and affix meanings and stuff separately (ex. "magically" = the three "words" magic-al-ly, but you need an option to note that "al" and "ly" won't ever come at the beginning of a word. "Language" = "langu-age", you'd again need to note "age" is always a suffix so it won't be confused with the verb "to age" etc). The same feature can be used to note what verb endings, plural endings etc. mean, ex. "eaten" would be "eat-en" and whenever you forgot what that "en" meant you could hover over it, so I think it'd be really useful.

But let's pretend I have the Icelandic (a cousin of German and Old English) word lögreglumaðurinn, it's actually 4 different words: lög (law), reglu (order), maður (human), inn (the), which ends up meaning "the policeman". I need it to be able to recognize all of these 4 separate words which I've put into the "dictionary" separately, because Icelandic constantly makes up new words on the fly from a very small original wordstock, their words aren't set in stone like English's are. Esperanto and Swedish/Norwegian/Danish are more or less exactly the same.

Within this, there are languages that change a vowel or remove a consonant when words are put together into compounds. For example, in "lögreglumaðurinn", actually that "reglu" is originally "regla" (lögreglamaðurinn). Since you already have the "guessing system" for words, if that system could work like normal when it recognizes a word within a word like this, it shouldn't be necessary to have a new feature. Otherwise it would be good to be able to define a conjugation type and then mark a word as belonging to that particular type or something, so the system can recognize the words better. Languages that can change sound in compound words like this are for example Finnish, Greenlandic, Icelandic, Swedish, Indonesian.

Indonesian is a language that has almost no prefixes and suffixes, and it doesn't compound words in the way Icelandic/Swedish/Esperanto does, but those prefixes/suffixes essentially make up the entirety of Indonesian grammar so they get used a lot. Let's pretend I have the Indonesian word "lihat" (see), it usually gets used with some prefix like "melihat" ("see-verb" = sees). I need it to recognize "lihat" even when "me" is there, as "me" is actually a separate word. It's the same concept as having "watch" in the dictionary and needing the thing to recognize the words "rewatch" and "unwatch" automatically. There's so few Indonesian prefixes that I don't really have to put them in the dictionary, so I tried adding the prefixes to the prefix blacklist in the hopes that it'd recognize the main word within, but it didn't seem to do anything.

It would also be good if you could choose to select (or just type in yourself) two words, or two words with a hyphen in-between, and add them to the dictionary because there's a lot of phrases people need to learn. I don't know if that's already possible.

It would also be great if you could mark a word as a spelling mistake / variation of another word. For example, Chinook Jargon has none of the problems above (no compound words written in one word) but it has no standardized spelling, so you get 10 variations of the same single word. Instead of having to search for and type in the same definition all the time, it'd be great if you could just link the word to another word (by filling in the standard spelling, which you've already put into the dictionary, for example) and have it automatically fill in the same definition. For example, you could have "cough", "kaf" and "caf", all being the same word but just spelled differently.

Finally, a lot of people are learning more than one language, so it'd be good if you could somehow switch modes in order to have dictionaries for separate languages (or have one big dictionary but the words be marked as to which language they belong to).

On top of using Wordology for normal language studies, I plan to use it (or a similar tool — I've already tried many times but the various similar Firefox addons keep getting too outdated to be able to use) to actually make and publish dictionaries for minority languages I know that don't already have dictionaries, or for languages like Icelandic and Greenlandic that have severely incorrect dictionaries. I just have to use a tool like this to "collect all the words + meanings in the language in the wild" if you know what I mean.

If it's kinda impossible to have the add-on work for compound words like lögreglumaðurinn, my idea is that you could do something else to split up the word before the tool parses it. For example inserting ' between words, like "lög'reglu'maður'inn", and then you'd need to be able to fix the splitting if it were incorrect. For Chinese and Japanese, I have a relatively simple solution, just insert a space or split before every Chinese letter (kanji/hanzi): the Japanese 女装好きな男 (basically "aguywholikesfemaleclothes") would become 女'装'好きな'男 or 女 装 好きな 男 ("a guy wholikes female clothes") for example. It again wouldn't be perfect but would be an incredibly huge help for anyone learning those, and then you'd be able to say that your add-on works for some Asian languages as well ; D

geajack commented 6 years ago

So, the basic problem here is that different languages work differently. I knew this would be an issue from the beginning, but I need lots of user feedback and, eventually, testing, in order to figure everything out, because of course I only speak a few of the world's languages. First some general thoughts on how I see the app developing:

Re: Highly synthetic langauges (like Icelandic) The idea here is to basically match words within words, rather than only considering a word to be something between two spaces. You could certainly imagine a "synthetic language mode" for languages like this. What concerns me here is the amount of processing that would be needed. Remember, when you hit the add-on button, Wordology has to process all the words on the page at once. Every single one has to be matched against your entire dictionary. If it had to look inside each word and basically try all possible ways of breaking it apart to see if anything matches, I'm worried it would end up taking like 20 seconds to process a longish page. But it would have to be tested to be sure.

Re: Indonesian and prefix-based languages In the languages I know, inflection is done mainly using suffixes. Thus, the core of the "smart matching" features are to only consider two words to match if they start with the same letters. The reason why adding "me-" to the blacklisted prefixes didn't do anything is because the prefix blacklist is actually to prevent matches between words with the same prefix. For example, in Polish a lot of unrelated verbs start with "prze-", because "prze-" is a general sort of intensifier prefix for verbs. Of course, some languages do things the other way around! They use prefixes to inflect words. Probably you need separate "prefix-based matching" and "suffix-based matching" modes. Something like this will very likely eventually be added.

Re: Adding phrases I don't really plan on adding this. From the beginning, Wordoogy's been intended as a tool with a pretty specific purpose: learning words. Even when I use it myself I run into this issue: in Polish, verbs often have a reflexive and irreflexive form, depending on whether the prefix "się" ("self") is put next to them. For example: wydać = to give, wydać się = to seem. I just get around this by setting a translation like "to give, rflx: to seem".

Re: Alternative spellings I'm not sure I quite understand this one. If you know that "kaf" is a different spelling of "cough", would manually linking it to the word "cough" really be less work than just typing in the definition a second time?

Thanks for the detailed feedback in any case. Of course the app has only been out for a day, so I won't be rushing to add new features just yet, but I'm listening to the ideas.

marbuljon commented 6 years ago

Hi, thanks for your reply. Yeah, in most languages it won't be a problem, but in the main language I'll be using this add-on for a single word can even have 20 different spellings (this includes hyphens and spaces). For example, it can be "muckamuck", "múkamúk", "muqmuq", "much much" "mak-mak", "məkymək", "mək-i-mək", "muckity much" and so on. In general the same language has a lot of hyphens and spaces that don't actually denote a different word, it's just how the person spelled it (as an English example, it'd be like writing "spell-ed" or "spe ll ed" instead of "spelled").

geajack commented 6 years ago

The hyphen thing is an interesting point and could be patched by just allowing the user to "whitelist" certain word-break characters, so you could set hyphens or apostrophes to not be considered word breaks.

firion1234 commented 6 years ago

I made a GitHub account just to implore you to reconsider the addition of phrases as a feature. You strike me as being an especially avid language learner so you're no doubt familiar with the importance of collocations in the acquisition process. I feel that highlighting (in this case literally) the fact that two words frequently occur together or when occurring together mean something completely different can help lighten some of the cognitive load of vocabulary acquisition. In other words, without having a phrase feature, the learner has to maintain a cognitive database of collocations instead of outsourcing the task to the add-on to keep track of. Repeatedly seeing two or more words inside the same highlight will eventually drive the point home but until then, the learner is left wondering "Have I seen this combination before?" Whereupon, he or she may dismiss the feeling that there is a connection.

I have spent years dreaming about what features I would include in a Learning With Texts style software but I am hopeless at formal languages and have tried and failed many times to teach myself several of them. I could likely fill a book with ideas I have on the subject, especially regarding interlinear text software. I digress, however, I say that to say this: I would gladly discuss with you the particulars of your project and provide as much feedback as you deem would be helpful to your endeavors. Thank you for reading and thank you for Wordology.