apertium / apertium-tat

Apertium linguistic data for Tatar
GNU General Public License v3.0
4 stars 3 forks source link

Loanwords after marking them #29

Closed mansayk closed 5 years ago

mansayk commented 5 years ago

Hello!

I made quite big commit: https://github.com/apertium/apertium-tat/commit/185c32570735be0fbf5a520bc1df2d3e48c70098 https://github.com/apertium/apertium-tat/commit/b5db00faafd6ec1ecef9c75148d65c1d4f3a01f4 where I marked about 3800 loanwords. After that they are not processed in analysis. Could you tell me what is wrong there?

jonorthwash commented 5 years ago

As far as I can tell, they are getting marked in the analysis.

$ echo абитуриент | apertium -d . tat-morph
^абитуриент/абитуриент<n><attr>/абитуриент<n><nom>/абитуриент<n><nom>+и<cop><aor><p3><pl>/абитуриент<n><nom>+и<cop><aor><p3><sg>$^./.<sent>$

$ echo Курил | apertium -d . tat-morph
^Курил/Курил<np><top><attr>/Курил<np><top><nom>/Курил<np><top><attr><err_orth>/Курил<np><top><nom><err_orth>/Курил<np><top><nom>+и<cop><aor><p3><pl>/Курил<np><top><nom>+и<cop><aor><p3><sg>/Курил<np><top><nom>+и<cop><aor><p3><pl><err_orth>/Курил<np><top><nom>+и<cop><aor><p3><sg><err_orth>$^./.<sent>$

Or else, maybe I don't understand the problem.

Also, note that you don't need %{☭%} on the right side of a given entry if it's categorised as N1-RUS. That is, you should change a line like

абитуриент:абитуриент%{☭%} N1-RUS ; ! ""

to just

абитуриент:абитуриент N1-RUS ; ! ""

. The reason is that the N1-RUS definition already contains %{☭%}. This will result in two %{☭%}s in the lexc transducer, e.g.,

$ echo "абитуриент<n><dat>" | hfst-lookup .deps/tat.LR.lexc.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> абитуриент<n><dat>    абитуриент{☭}{☭}>{G}{A} 0.000000

This has the potential to break a certain amount of phonology.

mansayk commented 5 years ago

It seems that adjectives don't have A1-RUS form, how should I mark them? And what about NP-TOP, NP-ANT-M, NP-COG-OB?.. Maybe it is better to leave the following form?

абитуриент:абитуриент%{☭%} N1 ; ! ""
jonorthwash commented 5 years ago

I would say it's better to use N1-RUS for nouns. For other parts of speech you can either make separate categories in the same way as N1-RUS or hard-code them like you have them.

One big advantage of having a separate category—besides not having to type/copy %{☭%} a lot—is that it will make it a lot easier to implement <err_orth> tags for forms that are spelled as if the words were not from Russian (like абитуриентне). In fact, we could simply add the following line to N1-RUS to achieve this:

N1 ; ! Err/Orth

On the other hand, perhaps not all words in this category are misspelled that way consistently, so it's possible we'd want to exclude them from getting <err_orth> tags. We could then either do everything manually or make a separate N1-RUS-ALWAYS category or similar. I favour more categories over hard-coding the phonology on a word-by-word basis.

mansayk commented 5 years ago

I did that, I added -RUS to many categories, for example, A1-RUS, NP-TOP-RUS... Please take a look. I hope everything is correct.