UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary
https://itwewina.altlab.app/
Apache License 2.0
22 stars 11 forks source link

Review workings/mappings of Cree-to-English and English-to-Cree phrase translation #1005

Open aarppe opened 2 years ago

aarppe commented 2 years ago

Currently, when typing in some inflected Cree word forms, or English phrases that should correspond to an English phrase or Cree word, respectively, the process appears to work only partially. For instance:

  1. Cree word-forms with certain preverbs, e.g. kikî-wâpamitin, kiwî-wâpamitin, kika-wâpamitin do not get an English phrase translation, whereas the non-tense form kiwâpamitin does.

  2. English phrases with some inflected forms of verbs, e.g. I saw you or I helped him do not appear to get analyzed resulting in a Cree word form, even though the FST tool does work for them, whereas non-inflected variants such as I see you and I help him get matched with a Cree word form.

  3. English phrases with certain multi-part arguments, e.g. you and we or you and us, do not appear to be analyzed properly, even though in the FST tool they get the expected analyses, whereas other such as you all do work.

nienna73 commented 2 years ago

As far as I can tell, the auto-translations only appear for entries that exist in the database. So even if an entry is analyzable, it won't get an auto-translation unless it was in the database at import-time. Auto-translations are generated at time of import. I'm trying to import the latest changes to the db now to see if I can fix this somehow.

aarppe commented 2 years ago

Ok. This would correspond to a recollection that I had about how the English translations were originally implemented, i.e. that they were pre-generated rather than generated on the fly.

As noted on several earlier occasions, there was a time when we generated the English phrases for all the word-forms in the paradigms, which included the most common grammatical preverbed forms, i.e. those with kî- and wî- (for both Ind and Cnj), ka- for Independent only (as Future Definite), plus then the ta-/ka- Infinitive forms (with only conjuncts).

The idea with the pregeneration was that it would take some while, but not too long, as it wouldn't be done incessantly.

As we've discussed earlier, generating the preverbed verbal word-forms maybe constitutes thrice the forms that we currently have (i.e. adding 8 subpanes to the current 3), so it should take like 4-5 times as long as currently - the time should be along O(n), not O(n^2).

The new transcriptor generating English verb phrases should be smaller and faster and more scalable.

This is probably also one of those coding implementations that we need to work through comprehensively, so that what remains is ideally primarily updating.

nienna73 commented 2 years ago

I finally found the portion of code that was ruling out preverb generation. If I add it back in, the functionality comes back, which is what we want. It does take significantly longer to import the dictionary (around 30 minutes on my machine, so likely an hour on the server), but since we aren't doing that often, I don't see it being a problem.

nienna73 commented 2 years ago

Here are the stats from importing the whole dictionary while generating auto-definitions on my machine:

100%|██████████| 23367/23367 [22:59<00:00, 16.94it/s] Translation stats: wordforms_examined: 2,892,178 definitions_created: 2,799,787 no_translation_count: 0 no_phrase_analysis_count: 89,663 multiple_phrase_analyses_count: 2,728 preverb_form: 0 unknown_tags_during_auto_translation: Building definition vectors 100%|██████████| 32244/32244 [00:35<00:00, 912.59it/s]

aarppe commented 2 years ago

Great! Are you using the newest English phrase generator transcriptor for verbs?

nienna73 commented 2 years ago

I don't think I am. Would I find that in the ALTLab repo?

aarppe commented 2 years ago

It's in the GiellaLT repo for crk, i.e. https://github.com/giellalt/lang-crk, in specific here: https://github.com/giellalt/lang-crk/blob/main/src/transcriptions/transcriptor-cw-eng-verb-entry2inflected-phrase-w-flags-and-templates.xfscript - but I'd need to check that all is up to date. I should note that the transcriptor relies on a number of files for the final FST (in https://github.com/giellalt/lang-crk/tree/main/src/transcriptions).