Open snomos opened 3 years ago
The folllowing works fine without divvun-normaliser
:
echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin | cg-mwesplit
"<Man>"
"Man" N Prop Sem/Plc Sg Nom <W:0.0>
"Man" N Prop Sem/Sur Sg Nom <W:0.0>
"man" Adv <W:0.0>
"mij" Pron Interr Sg Gen <W:0.0>
"mij" Pron Interr Sg Ill Attr <W:0.0>
"mij" Pron Interr Sg Ine Attr <W:0.0>
"mij" Pron Rel Sg Gen <W:0.0>
"mij" Pron Rel Sg Ill Attr <W:0.0>
"mij" Pron Rel Sg Ine Attr <W:0.0>
:
"<vuoras>"
"vuoras" A Attr <W:0.0>
"vuoras" A Sg Nom <W:0.0>
"vuoras" Err/Orth A Attr <W:0.0>
"vuoras" Err/Orth A Sg Nom <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
":" CLB <W:0.0>
:
"<23>"
"23" A Arab Ord Attr CLBfinal <W:0.0>
"23" Num Arab Sg Ela Attr <W:0.0>
"23" Num Arab Sg Gen <W:0.0>
"23" Num Arab Sg Ill Attr <W:0.0>
"23" Num Arab Sg Ine Attr <W:0.0>
"23" Num Arab Sg Nom <W:0.0>
"23" Num Sem/ID <W:0.0>
:\n
But with divvun-normaliser
I get a lidivvun
error (and not the expected output format):
echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin \
| cg-mwesplit \
| divvun-normaliser -a src/analyser-gt-desc.hfst -n tools/tts/transcriptor-gt-desc.hfst -g src/generator-gt-norm.hfst
libdivvun: ERROR: HfstException.
"<Man>"
:
"<vuoras>"
"<:>"
:
"<23>"
:\n
It seems I didn't manage to set the default for -t tags
so it didn't print nothing, now it should copy input if no tags are set to be expanded.
pushed few more debugging; it seems we need hfstol's to lookup_fd:
echo 'Man vuoras: 23' | hfst-tokenise -g ~/github/giellalt/lang-smj/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g ~/github/giellalt/lang-smj/tools/tokenisers/mwe-dis.bin | cg-mwesplit | src/divvun-normaliser -a ~/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol -n ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -g ~/github/giellalt/lang-smj/src/generator-gt-norm.hfstol --tags Arab -v
libdivvun: ERROR: HfstException: Exception: NotTransducerStreamException: transducer type not recognised in file: HfstInputStream.cc on line: 1088
Read /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol, /home/flammie/github/giellalt/lang-smj/src/generator-gt-norm.hfstol, /home/flammie/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol
"<Man>"
"Man" N Prop Sem/Plc Sg Nom <W:0.0>
"Man" N Prop Sem/Sur Sg Nom <W:0.0>
"man" Adv <W:0.0>
"mij" Pron Interr Sg Gen <W:0.0>
"mij" Pron Interr Sg Ill Attr <W:0.0>
"mij" Pron Interr Sg Ine Attr <W:0.0>
"mij" Pron Rel Sg Gen <W:0.0>
"mij" Pron Rel Sg Ill Attr <W:0.0>
"mij" Pron Rel Sg Ine Attr <W:0.0>
:
"<vuoras>"
"vuoras" A Attr <W:0.0>
"vuoras" A Sg Nom <W:0.0>
"vuoras" Err/Orth A Attr <W:0.0>
"vuoras" Err/Orth A Sg Nom <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
":" CLB <W:0.0>
:
"<23>"
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" A Arab Ord Attr CLBfinal <W:0.0>
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" Num Arab Sg Ela Attr <W:0.0>
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" Num Arab Sg Gen <W:0.0>
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" Num Arab Sg Ill Attr <W:0.0>
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" Num Arab Sg Ine Attr <W:0.0>
"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
"23" Num Arab Sg Nom <W:0.0>
"23" Num Sem/ID <W:0.0>
:\n
Nice progress 🙂
@unhammer are there any CG syntax restrictions on the transcripted string, "guaktalåkgålmmå"phon
in the test case above? We modelled it after the divvun-cgspell
output, but that one has only one letter after the actual string. Just asking to avoid major changes later 🙂
"guaktalåkgålmmå"phon
is a valid CG tag, though it is not considered a textual tag - not that I think that matters for you. The rule is that if it starts with "
then include anything to next "
and from there include to next whitespace. This avoids much unnecessary escaping.
A case we haven't considered: dynamic compounds, ie cohorts with sub-readings. There are two considerations:
echo 1800-lågon | ./tools/tts/modes/smj-txt2ipa.mode
"<1800-lågon>"
"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
:\n
If we could normalize 1800-
independently of the rest of the compound, we would solve a lot of corner cases.
Perhaps the best solution would be to not change the basic cohort structure at all, ie that we do NOT add the original lemma as a subreading. Instead I suggest that we store the original in a tag string along the lines of the "abc"phon
string, something like: "1800-"orig
or "1800-"olemma
or something similar. The main purpose of retaining the original lemma is for debugging, and changing the cohort structure seems to cost too much.
@flammie could you have a look at this? I added the new tasks to the task list in the initial comment.
Draft specification here.
Tasks: