Closed mansayk closed 5 years ago
As far as I can tell, they are getting marked in the analysis.
$ echo абитуриент | apertium -d . tat-morph
^абитуриент/абитуриент<n><attr>/абитуриент<n><nom>/абитуриент<n><nom>+и<cop><aor><p3><pl>/абитуриент<n><nom>+и<cop><aor><p3><sg>$^./.<sent>$
$ echo Курил | apertium -d . tat-morph
^Курил/Курил<np><top><attr>/Курил<np><top><nom>/Курил<np><top><attr><err_orth>/Курил<np><top><nom><err_orth>/Курил<np><top><nom>+и<cop><aor><p3><pl>/Курил<np><top><nom>+и<cop><aor><p3><sg>/Курил<np><top><nom>+и<cop><aor><p3><pl><err_orth>/Курил<np><top><nom>+и<cop><aor><p3><sg><err_orth>$^./.<sent>$
Or else, maybe I don't understand the problem.
Also, note that you don't need %{☭%}
on the right side of a given entry if it's categorised as N1-RUS
. That is, you should change a line like
абитуриент:абитуриент%{☭%} N1-RUS ; ! ""
to just
абитуриент:абитуриент N1-RUS ; ! ""
. The reason is that the N1-RUS
definition already contains %{☭%}
. This will result in two %{☭%}
s in the lexc
transducer, e.g.,
$ echo "абитуриент<n><dat>" | hfst-lookup .deps/tat.LR.lexc.hfst
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> абитуриент<n><dat> абитуриент{☭}{☭}>{G}{A} 0.000000
This has the potential to break a certain amount of phonology.
It seems that adjectives don't have A1-RUS form, how should I mark them? And what about NP-TOP, NP-ANT-M, NP-COG-OB?.. Maybe it is better to leave the following form?
абитуриент:абитуриент%{☭%} N1 ; ! ""
I would say it's better to use N1-RUS
for nouns. For other parts of speech you can either make separate categories in the same way as N1-RUS
or hard-code them like you have them.
One big advantage of having a separate category—besides not having to type/copy %{☭%}
a lot—is that it will make it a lot easier to implement <err_orth>
tags for forms that are spelled as if the words were not from Russian (like абитуриентне
). In fact, we could simply add the following line to N1-RUS
to achieve this:
N1 ; ! Err/Orth
On the other hand, perhaps not all words in this category are misspelled that way consistently, so it's possible we'd want to exclude them from getting <err_orth>
tags. We could then either do everything manually or make a separate N1-RUS-ALWAYS
category or similar. I favour more categories over hard-coding the phonology on a word-by-word basis.
I did that, I added -RUS to many categories, for example, A1-RUS, NP-TOP-RUS... Please take a look. I hope everything is correct.
Hello!
I made quite big commit: https://github.com/apertium/apertium-tat/commit/185c32570735be0fbf5a520bc1df2d3e48c70098 https://github.com/apertium/apertium-tat/commit/b5db00faafd6ec1ecef9c75148d65c1d4f3a01f4 where I marked about 3800 loanwords. After that they are not processed in analysis. Could you tell me what is wrong there?