Normalize each CG sub-reading separately, like phonemisation

divvun / libdivvun

lib for running gramcheck and other pipelines + cli; modules for CG→spelling, CG→feedback, tagging blanks

https://giellalt.github.io/proof/gramcheck/GrammarCheckerDocumentation.html

GNU General Public License v3.0

9 stars 1 forks source link

Normalize each CG sub-reading separately, like phonemisation #58

Open snomos opened 1 year ago

snomos commented 1 year ago

Cf https://github.com/divvun/libdivvun/issues/44#issuecomment-1098740800

See also the following example:

Ulmme lij gehtjadit gåktu 25 jahkebuolva 15-jagágij lidjin jåhtålam 23 sáme-vuona rabdaguovlo suohkanijs.

where in 15-jagágij 15 is not transcribed.

flammie commented 1 year ago

Currently the transcriptor is set up to look up nearest surface form, with subreadings without surface form tags or other similar tags it falls back to 15-jagágij which is not in transcriptor. Maybe using lemma makes sense with transcription though

flammie commented 1 year ago

I see the other bug now, yeah it would be much easier possibly to not mess with more subreadings here...

flammie commented 1 year ago

"<15-jagágij>"
    "jahke" Ex/N Sem/Time Der/k A <smj> Pl Com <W:0.0> @<ADVL
        "lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma
    "jahke" Ex/N Sem/Time Der/k A <smj> Sg Ill <W:0.0> @<ADVL
        "lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma

this is current output after normalise

snomos commented 1 year ago

Looks good to me. What do you think, @ilm024 ?

what would be the full compound output?

snomos commented 1 year ago

With newest divvun-normalise I get the following:

"<15-jagágij>"
    "jahke" Ex/N Sem/Time Der/k A Pl Com "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
        "15" Num Cmp/Hyph Cmp "15-#»jagág9>ij"MIDTAPE <W:0.0> #7->3
    "jahke" Ex/N Sem/Time Der/k A Sg Ill "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
        "15" Num Cmp/Hyph Cmp "15-#»jagág9>ij"MIDTAPE <W:0.0> #7->3

What is missing to get what you get?

flammie commented 1 year ago

Probably version differences, the midtapes would confuse the normalise lookup and I don't get them with my hfst as it is now. So the output of smj-normaliser6-cg,mode is just:

"<15-jagágij>"
    "jahke" Ex/N Sem/Time Der/k A Pl Com <W:0.0> @<ADVL #7->3
        "15" Num Cmp/Hyph Cmp <W:0.0> #7->3
    "jahke" Ex/N Sem/Time Der/k A Sg Ill <W:0.0> @<ADVL #7->3
        "15" Num Cmp/Hyph Cmp <W:0.0> #7->3

snomos commented 1 year ago

Ok. What is the input and the command you used to get the desired output?

flammie commented 1 year ago

e.g. echo Ulmme lij gehtjadit gåktu 25 jahkebuolva 15-jagágij lidjin jåhtålam 23 sáme-vuona rabdaguovlo suohkanijs | $GTLANGS/lang-smj/tools/tts/modes/smj-normaliser6-cg.mode etc., not sure why I don't get midtapes, deubgging like with --verbose: echo 15-jagágij | ~/github/hfst/hfst/tools/src/hfst-tokenize -g '/home/flammie/github/giellalt/lang-smj/tools/tokenisers/tokeniser-tts-cggt-desc.pmhfst.tmp' -v just shows no results for lookups on midtapes

flammie commented 1 year ago

I commented midtape reading out , not sure if it made sense in normalising step or copy-paste from phonemiser

snomos commented 1 year ago

I am not sure either whether we need midtape in the normaliser process, but we definitely need to retain midtape strings for later IPA conversion.

IIRC the idea was to have an option for "deep analysis" that would generate the midtape stuff for normalised input.

flammie commented 1 year ago

well, midtape is kind-of retained now if it gets used by phon:

"<15-jagágij>"
    "jahke" Ex/N Sem/Time Der/k A Pl Com "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
        "lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma
    "jahke" Ex/N Sem/Time Der/k A Sg Ill "15-#»jagág9>ij"MIDTAPE <W:0.0> @<ADVL #7->3
        "lågenanvihtta" Num Sg Nom "lågenanvihtta"phon "15"oldlemma

snomos commented 1 year ago

ok, good 🙂

snomos commented 1 year ago

we probably have to use the deep analyser thing to get a full MIDTAPE representation, if we need that