UAlbertaALTLab / morphodict

The Language Independent Intelligent Dictionary
https://morphodict.readthedocs.io/
Apache License 2.0
23 stars 11 forks source link

Update crkeng.xml #469

Closed aarppe closed 4 years ago

aarppe commented 4 years ago

Following up on the fix in #465, we would next need to import that into itwêwina for the new/corrected inflectional categories to take effect.

The new version can be found in the ALTLab GIT repo:

altlab/crk/dicts/crkeng.xml

aarppe commented 4 years ago

@Madoshakalaka Do you have access to the ALTLab repo mentioned above?

@eddieantonio In order for us to make the most of the recent improvement showing inflectional categories etc., we need to update our Cree-to-English dictionary source to conform with the latest AEW coding, as is now fixed in the crkeng.xml file.

eddieantonio commented 4 years ago

@Madoshakalaka Do you have access to the ALTLab repo mentioned above?

@eddieantonio In order for us to make the most of the recent improvement showing inflectional categories etc., we need to update our Cree-to-English dictionary source to conform with the latest AEW coding, as is now fixed in the crkeng.xml file.

:+1: okay. Matt usually does the DB updates! But I'll see if I can do it.

eddieantonio commented 4 years ago

@arppe: a few things to note in this version of crkeng.xml:

<t> has empty content in entry

 <e>
   <lg>
      <l pos="N">ohpinikêwin</l>
      <lc>NI-1</lc>
      <stem>ohpinikêwin-</stem>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="MD" />
   </tg>
   </mg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="CW">weightlifting; act of lifting things</t>
   </tg>
   </mg>
</e>

There are 1078 (lemma, pos, ic) that the fst can not give any analyses. There are 173 (lemma, pos, ic) that do not have proper lemma analysis by fst There are 13 (lemma, pos, ic) that have ambiguous lemma analyses These words will be label 'as-is', meaning their lemmas are undetermined.

Thanks @Madoshakalaka for the implementing these diagnositc messages!

eddieantonio commented 4 years ago

Done!

Screen Shot 2020-06-22 at 9 28 00 AM

aarppe commented 4 years ago

@eddieantonio Great! Was ohpinikêwin the only entry for which the <t> field is missing?

Also, are the results of the diagnostics available somewhere? I.e. to check why some forms are not analyzed, or incorrectly analyzed?

eddieantonio commented 4 years ago

@eddieantonio Great! Was ohpinikêwin the only entry for which the <t> field is missing?

It's the only one that the diagnostics reported, yes.

Also, are the results of the diagnostics available somewhere? I.e. to check why some forms are not analyzed, or incorrectly analyzed?

Nope :/ We could make that a think we log, but currently, the database is generated on our local machines, then pushed to Sapir.

Madoshakalaka commented 4 years ago

@aarppe Everytime we rebuild the database, a detailed log of these diagnostics are recorded. I could send you one if you'd like. Part of the log looks like this:

2020-06-16 16:22:52,889 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mêstan-pîwayân with pos N ic NI-1 can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,889 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mêstâciwatêw with pos V ic VII-v can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,889 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mêsti- with pos Ipc ic IPV can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,890 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry micakisîs with pos N ic NI-1 have analyses by fst strict analyzer. Yet all analyses conflict with the pos/ic in xml file
2020-06-16 16:22:52,890 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry micimôtâw with pos V ic VTI-2 can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,891 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry miciyawêsiw with pos V ic VAI-v can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,891 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihko- with pos Ipc ic IPV can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,891 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihkopêmak with pos N ic NA-3 can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,892 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihkowi- with pos Ipc ic IPN can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,892 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihkwaskîwakâhk with pos N ic INM have analyses by fst strict analyzer. Yet all analyses conflict with the pos/ic in xml file
2020-06-16 16:22:52,892 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihyawê- with pos Ipc ic IPV can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,892 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mihyawê- with pos Ipc ic IPN can not be analyzed by fst strict analyzer
2020-06-16 16:22:52,892 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mikisiwacîhk with pos N ic INM have analyses by fst strict analyzer. Yet all analyses conflict with the pos/ic in xml file
2020-06-16 16:22:52,893 — DatabaseManager.xml_entry_lemma_finder — DEBUG — xml entry mikisiwi- with pos Ipc ic IPN can not be analyzed by fst strict analyzer

Do you have access to the ALTLab repo mentioned above?

I tried the other day with @eddieantonio but there seems to be problems in ssh authentication. I'll figure things out and try to access that repo again

aarppe commented 4 years ago

Added the missing English translation from MD for ohpinikêwin to crkeng.xml - so that should be good in the subsequent iterations, until we have a more proper dictionary database. Some <stem> fields still have unnecessary information, which would need to be removed (namely inflectional category codes).