Test tools for automated lemmatization

khoidt commented 7 years ago

Summary

For the project, we need to test tools for automated lemmatization, stemming and morphology induction. Please try to get them running on and evaluate against ETCSL CoNLL data. If you run into difficulties, you can ask Kathrin for help (she's also preferring Python), if that doesn't help, have a try with the next tool on the list. Note that our top priority is to eliminate morphological variation, not so much to produce linguistically plausible lemmas.

Tasks

[x] CSTlemmatizer (use the "deprecated" mode to train flex rules directly from it, https://github.com/kuhumcst/cstlemma)
[ ] Affixtrain + Bratmac (https://github.com/kuhumcst/affixtrain/tree/master/example)
[x] Affixtrain + CSTlemmatizer
[x] Morfessor (unsupervised morphology induction, train on word forms, only; https://pypi.python.org/pypi/Morfessor)
[ ] Clark 2003 (unsupervised POS tagging, train on word forms, only; http://www.cs.rhul.ac.uk/home/alexc/pos2.tar.gz)
[x] Marmot/Lemming (POS and morph: http://cistern.cis.lmu.de/marmot/, http://cistern.cis.lmu.de/lemming/)
[ ] the Oracc lemmatizer
[ ] Search for related (newer) tools from scientific literature, e.g. using ACL anthology or google scholar.

Notes

CSTlemmatizer, Affixtrain and Marmot are supervised, train on WORD + LEMMA (+ POS + MORPH).

Roadmap Data

🗓 Start Date: 08-01-2017

🗓 Expected Date: 11-15-2017

💪 Label: wp

📈 Progress (0-1): 0.4

See Gantt: http://cdli-dev.org/gantt/mtaac_work/

khoidt commented 7 years ago

Obstacles for matching a lemma: study cases

This comment lists cases (with examples) when a simple match between a form and a lemma is not possible. The list pretends to no completeness, but tentatively covers the most frequent study cases.

1. Phonetic component (graphic reduplication of consonants):

{d}asar-ri: asari gab2-bu-zu: gabu2

2. Syllabic spellings and/or usage of different signs:

ku-še: kuš2 gu2-un: gun2 u8-a: u5-a {gi}ub4-zal-zu: ub-zal {ŋiš}ildag3: ildag2 esir-bi: esir2 a-ah: ah6

3. Different / dialectal forms:

{d}mu-ul-lil2: en-lil2 {d}na-zi: nanše šu-ru-ug-ga: šarag {ŋiš}i-ri9-na-zu: erina8

4. Defective spellings / incorrect forms:

lu2-lul: lu2-lul-la

5. Determinatives:

{d}meš3-ki-aŋ2-{d}nanna meš3-ki-aŋ2-nanna

6. Numerals:

li-mu-um-ta-am3: 1000

7. Suppletion:

im-ma-ni-ŋa2-ŋa2: ŋar na-ma-tum3: de6 dur2-ru-na-ba: tuš

8. Inconsistent transliteration:

šar2-ra-ab-du: šar2-ra-ab-DU niŋ2-ur2-4-e: niŋ2-ur2-limmu2

While in cases 1-2 and 5 the token's lemma can be easily identified after merging the signs or removing the determinatives, the others pose a somewhat more complex problem. A possible solution would be a lexicon to match variants and suppletion (esp. 3, 6 and 7) and a sign-list to detect transliteration inconsistencies (esp. 8).

epageperron commented 7 years ago

I have a feeling this type of problems will be much more rare in our administrative corpus. As for the sign list, we have some sign lists and glossaries, we also have been discussing of using the ePSD. I will follow up ASAP on this and get back to you.

Follow-up: See issue #7

khoidt commented 7 years ago

Test 1: CSTlemmatizer in the "deprecated" mode

Update:

See most recently summary 1 and earlier update 1 and update 2 below.

Training data: Four sets of forms supplemented with lemma and POS information; entries are unique.

Sets 1a and 1b: 21 760 forms (80% of all the material)
Sets 2a and 2b: 4 877 forms (80% of the ETSCRI material)

The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a and 2a) or escaped with Unicode characters, replacing the vowel in the sign (1b and 2b).

Testing data: Training data consists of 10% of the total material of each set.

Evaluation: Results yielding multiple lemmata as variants are considered as correct if include the correct lemma.

Set 1a: Total forms: 3109 Correct: 1185 (38.115149565776775) Incorrect: 1924 (61.884850434223225)

Set 1b: Total forms: 3109 Correct: 1056 (33.96590543583146) Incorrect: 2053 (66.03409456416854)

Set 2a: Total forms: 691 Correct: 342 (49.493487698986975) Incorrect: 349 (50.506512301013025)

Set 2b: Total forms: 691 Correct: 305 (44.138929088277855) Incorrect: 386 (55.86107091172214)

Issues: I was not been able to get CSTlemmatizer to include POS information in the results.

khoidt commented 7 years ago

Test 1: CSTlemmatizer in the "deprecated" mode (UPDATE)

See description above.

Update:

See most recently summary 1 and earlier update 2 below.

Changes: data set for testing contains non-unique entries; stem evaluation, and granularity are added.

Evaluation:

Set 1a:
Total forms: 3089 Token granularity: 1.941904761904762 Stem granularity: 1.6931124673060158 Correct token forms: 1050 (33.99158303658142) Incorrect token forms: 2039 (66.00841696341858) Correct stem forms: 1147 (37.13175785043703) Incorrect stem forms: 1942 (62.86824214956297)

Set 1b: Total forms: 708 Token granularity: 1.058139534883721 Stem granularity: 0.6090909090909091 Correct token forms: 344 (48.58757062146893) Incorrect token forms: 364 (51.41242937853107) Correct stem forms: 440 (62.14689265536723) Incorrect stem forms: 268 (37.85310734463277)

Set 2a: Total forms: 708 Token granularity: 1.058139534883721 Stem granularity: 0.6090909090909091 Correct token forms: 344 (48.58757062146893) Incorrect token forms: 364 (51.41242937853107) Correct stem forms: 440 (62.14689265536723) Incorrect stem forms: 268 (37.85310734463277)

Set 2b: Total forms: 708 Token granularity: 1.1585365853658538 Stem granularity: 0.7395577395577395 Correct token forms: 328 (46.32768361581921) Incorrect token forms: 380 (53.67231638418079) Correct stem forms: 407 (57.48587570621469) Incorrect stem forms: 301 (42.51412429378531)

khoidt commented 7 years ago

Test 1: CSTlemmatizer in the "deprecated" mode (UPDATE 2)

Update:

See summary 1 below, description above and first update.

Changes:

Add 'sign and determinative' normalization (s_and_d, sets 3a-b): form and lemma are represented only by the first sign and first determinative (if presents); indices are removed, as in norm.
Exclude suppletion and variations between lemma and form (i.e. form always includes lemma).
Add 'suggestions' as parameter (i.e. the average number of lemma variants per form in total / for forms detected as correct / incorrect).

Evaluation:

Set 1a: Evaluated file: test_result_norm Control file: testing_full_data_norm Total forms: 1465 Token granularity: 1.4214876033057853 Stem granularity: 1.1512481644640236 Correct token forms: 605 (41.29692832764505 %) Incorrect token forms: 860 (58.703071672354945 %) Correct stem forms: 681 (46.484641638225256 %) Incorrect stem forms: 784 (53.515358361774744 %) Av. suggestions: 1.0177474402730375 Av. suggestions correct tokens: 1.0132231404958678 Av. suggestions incorrect tokens: 1.0209302325581395

Set 1b: Evaluated file: test_result_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Token granularity: 0.9015544041450778 Stem granularity: 0.6167400881057268 Correct token forms: 193 (52.58855585831063 %) Incorrect token forms: 174 (47.41144414168937 %) Correct stem forms: 227 (61.85286103542234 %) Incorrect stem forms: 140 (38.14713896457766 %) Av. suggestions: 1.0163487738419619 Av. suggestions correct tokens: 1.0207253886010363 Av. suggestions incorrect tokens: 1.0114942528735633

Set 2a: Evaluated file: test_result_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Token granularity: 1.437603993344426 Stem granularity: 1.1996996996996998 Correct token forms: 601 (41.02389078498293 %) Incorrect token forms: 864 (58.97610921501706 %) Correct stem forms: 666 (45.46075085324232 %) Incorrect stem forms: 799 (54.539249146757676 %) Av. suggestions: 1.0197952218430033 Av. suggestions correct tokens: 1.0116472545757071 Av. suggestions incorrect tokens: 1.025462962962963

Set 2b: Evaluated file: test_result_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Token granularity: 0.8535353535353536 Stem granularity: 0.5887445887445888 Correct token forms: 198 (53.950953678474114 %) Incorrect token forms: 169 (46.049046321525886 %) Correct stem forms: 231 (62.94277929155314 %) Incorrect stem forms: 136 (37.05722070844687 %) Av. suggestions: 1.008174386920981 Av. suggestions correct tokens: 1.005050505050505 Av. suggestions incorrect tokens: 1.0118343195266273

Set 3a: Evaluated file: test_result_s_and_d Control file: testing_full_data_s_and_d Total forms: 1465 Token granularity: 0.5372507869884575 Stem granularity: 0.47681451612903225 Correct token forms: 953 (65.05119453924914 %) Incorrect token forms: 512 (34.948805460750854 %) Correct stem forms: 992 (67.71331058020478 %) Incorrect stem forms: 473 (32.28668941979522 %) Av. suggestions: 7.82320819112628 Av. suggestions correct tokens: 7.961175236096537 Av. suggestions incorrect tokens: 7.56640625

Set 3b: Evaluated file: test_result_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Token granularity: 0.34926470588235303 Stem granularity: 0.3345454545454545 Correct token forms: 272 (74.11444141689374 %) Incorrect token forms: 95 (25.885558583106267 %) Correct stem forms: 275 (74.93188010899183 %) Incorrect stem forms: 92 (25.068119891008173 %) Av. suggestions: 4.822888283378747 Av. suggestions correct tokens: 3.4375 Av. suggestions incorrect tokens: 8.789473684210526

Remarks:

Note the significantly better results in comparison to the previous versions of the test. However, it should be noted that the 'sign and determinative' normalization, which demonstrates the highest rate of correct forms, has also a very high rate of ambiguity ('suggestions').

khoidt commented 7 years ago

Test 2: Affixtrain + CSTlemmatizer

Update:

See summary 1 below.

Training data:

Six sets of forms supplemented with lemma and POS information; entries are non-unique. Suppletion and variations between lemma and form are excluded (i.e. form always includes lemma).

Sets 1a, 2a, and 3a: 9 912 forms (80% of all the material)
Primary sets 1b, 2b, and 3b: 2 465 forms (80% of the ETSCRI material)

The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a-b) or escaped with Unicode characters, replacing the vowel in the sign (2a-b). Another two sets, 'sign and determinative' normalization (3a-b) have form and lemma represented only by the first sign and first determinative (if exists); indices are removed, as in 1a-b.

Testing data:

Training data consists of 10% of the total material of each set.

Evaluation:

Results yielding multiple lemmata as variants are considered as correct if have the correct lemma among the choices. Tests include token (forms checked against lemmata) and stem evaluation (i.e. lemmata as forms checked against themselves) and average suggestions pro token.

Set 1a:
Evaluated file: test_result_stem_norm Control file: testing_full_data_norm Total forms: 1465 Token granularity: 5.02880658436214 Stem granularity: 0.0020519835841312783 Correct token forms: 243 (16.58703071672355 %) Incorrect token forms: 1222 (83.41296928327645 %) Correct stem forms: 1462 (99.7952218430034 %) Incorrect stem forms: 3 (0.20477815699658702 %) Av. suggestions: 1.0614334470989761 Av. suggestions correct tokens: 1.3580246913580247 Av. suggestions incorrect tokens: 1.0024549918166938

Set 1b: Evaluated file: test_result_stem_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Token granularity: 2.67 Stem granularity: 0.0 Correct token forms: 100 (27.247956403269754 %) Incorrect token forms: 267 (72.75204359673025 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 1.0163487738419619 Av. suggestions correct tokens: 1.06 Av. suggestions incorrect tokens: 1.0

Set 2a: Evaluated file: test_result_stem_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Token granularity: 4.979591836734694 Stem granularity: 0.0013670539986330166 Correct token forms: 245 (16.72354948805461 %) Incorrect token forms: 1220 (83.27645051194538 %) Correct stem forms: 1463 (99.86348122866895 %) Incorrect stem forms: 2 (0.13651877133105803 %) Av. suggestions: 1.0607508532423209 Av. suggestions correct tokens: 1.3551020408163266 Av. suggestions incorrect tokens: 1.001639344262295

Set 2b: Evaluated file: test_result_stem_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Token granularity: 2.707070707070707 Stem granularity: 0.0 Correct token forms: 99 (26.975476839237057 %) Incorrect token forms: 268 (73.02452316076294 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 1.005449591280654 Av. suggestions correct tokens: 1.02020202020202 Av. suggestions incorrect tokens: 1.0

Set 3a: Evaluated file: test_result_stem_s_and_d Control file: testing_full_data_s_and_d Total forms: 1465 Token granularity: 0.4694082246740221 Stem granularity: 0.0 Correct token forms: 997 (68.05460750853243 %) Incorrect token forms: 468 (31.945392491467576 %) Correct stem forms: 1465 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 4.286689419795222 Av. suggestions correct tokens: 5.829488465396189 Av. suggestions incorrect tokens: 1.0

Set 3b: Evaluated file: test_result_stem_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Token granularity: 0.20327868852459008 Stem granularity: 0.0 Correct token forms: 305 (83.10626702997276 %) Incorrect token forms: 62 (16.893732970027248 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 2.7193460490463215 Av. suggestions correct tokens: 3.0688524590163935 Av. suggestions incorrect tokens: 1.0

Remarks:

Note the unexpected drop of correct token form alongside with excellent stem test results (around 100% in all tests). The results for sets 3a-b are significantly better than in test 1 in general and - importantly - demonstrate a slightly lower av. of suggestions.

jayanthkmr commented 7 years ago

@epageperron @khoidt Morphessor: Work in Progress.

Set up and Training on small annotations Done:

flatcat-train experiments/akk-data/segmentation.txt --perplexity-threshold 100 --save-binary-model model.pickled --statsfile stats.pickled --stats-annotations experiments/akk-data/annotations.txt
Initializing from segmentation...
INFO:flatcat.io:Reading segmentations from 'experiments/akk-data/segmentation.txt'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:flatcat.io:Done.
INFO:flatcat.categorizationscheme:Setting perplexity-threshold to 100.0
INFO:flatcat.flatcat:Iteration  0 (reestimate_with_unchanged_segmentation).  1/15
INFO:flatcat.flatcat:Segmentation differences:  3 (limit  0). Cost difference: 0.0
INFO:flatcat.flatcat:Iteration  0 (reestimate_with_unchanged_segmentation).  2/15
INFO:flatcat.flatcat:Segmentation differences:  2 (limit  0). Cost difference: -2.1993396743
INFO:flatcat.flatcat:Iteration  0 (reestimate_with_unchanged_segmentation).  3/15
INFO:flatcat.flatcat:Segmentation differences:  0 (limit  0). in iteration  3    (Converged).
INFO:flatcat.io:Reading annotations from 'experiments/akk-data/annotations.txt'...
INFO:flatcat.io:Done.
INFO:flatcat.flatcat:epoch      1/4           Cost:  459.8622.
INFO:flatcat.flatcat:Epoch  1, operation  0 (split), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  459.8622.
.
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/1  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 459.862210484
INFO:flatcat.flatcat:Epoch  1, operation  1 (join), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  459.8622.
.
INFO:flatcat.flatcat:Cost difference  -60.6505 (limit 0.025) in iteration  1/1  
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 399.211731719
INFO:flatcat.flatcat:Epoch  1, operation  2 (resegment), max  2 iteration(s).
INFO:flatcat.flatcat:iteration  1/2           Cost:  399.2117.

INFO:flatcat.flatcat:Before iteration update. Cost: 399.211731719
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/2  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 399.211731719
INFO:flatcat.flatcat:Cost difference  -60.6505 (limit None) in epoch      1/4  (fixed number of epochs)
INFO:flatcat.flatcat:epoch      2/4           Cost:  399.2117.
INFO:flatcat.flatcat:Epoch  2, operation  0 (split), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  399.2117.
.
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/1  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 399.211731719
INFO:flatcat.flatcat:Epoch  2, operation  1 (join), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  399.2117.
.
INFO:flatcat.flatcat:Cost difference  -27.9765 (limit 0.025) in iteration  1/1  
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 371.235273963
INFO:flatcat.flatcat:Epoch  2, operation  2 (resegment), max  2 iteration(s).
INFO:flatcat.flatcat:iteration  1/2           Cost:  371.2353.

INFO:flatcat.flatcat:Before iteration update. Cost: 371.235273963
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/2  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 371.235273963
INFO:flatcat.flatcat:Cost difference  -27.9765 (limit None) in epoch      2/4  (fixed number of epochs)
INFO:flatcat.flatcat:epoch      3/4           Cost:  371.2353.
INFO:flatcat.flatcat:Epoch  3, operation  0 (split), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  371.2353.
.
INFO:flatcat.flatcat:Cost difference   -1.6536 (limit 0.025) in iteration  1/1  
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch  3, operation  1 (join), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  369.5817.
.
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/1  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch  3, operation  2 (resegment), max  2 iteration(s).
INFO:flatcat.flatcat:iteration  1/2           Cost:  369.5817.

INFO:flatcat.flatcat:Before iteration update. Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/2  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference   -1.6536 (limit None) in epoch      3/4  (fixed number of epochs)
INFO:flatcat.flatcat:epoch      4/4           Cost:  369.5817.
INFO:flatcat.flatcat:Epoch  4, operation  0 (split), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  369.5817.
.
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/1  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch  4, operation  1 (join), max  1 iteration(s).
INFO:flatcat.flatcat:iteration  1/1           Cost:  369.5817.
.
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/1  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch  4, operation  2 (resegment), max  2 iteration(s).
INFO:flatcat.flatcat:iteration  1/2           Cost:  369.5817.

INFO:flatcat.flatcat:Before iteration update. Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference    0.0000 (limit 0.025) in iteration  1/2  (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference    0.0000 (limit None) in epoch      4/4  (fixed number of epochs)
Final cost: 369.581687673
Training time: 0.338s
Saving binary model...
INFO:morfessor.io:Saving model to 'model.pickled'...
INFO:morfessor.io:Done.
Done.

Sample seen Input:

ezida
bamul
lubauke
harasigsig
mundadage
gurece
namjurucju
imajajane
gacananajen
ninjirsukada

Sample seen Output:

ezida/STM
ba/STM mul/STM
lubau/STM ke/SUF
hara/STM sig/STM sig/STM
mun/STM dadag/STM e/SUF
gur ece/STM
namjuruc/STM ju/SUF
ima/STM jar jane/STM
gaca/STM inana  jen/STM
ninjirsu/STM kada/STM

Sample Unseen Input:

nubcibgigia
humucingigi
jectugara
jalajuce
muntemga
nusigace
anziba
buluj
mininhule
humuninrig

Sample Unseen Output:

nubcib/STM gigia/STM
humu/STM cingigi/SUF
jectug/STM ara/STM
jala/STM juce/STM
mun/STM temga/STM
nusiga/STM ce/SUF
anziba/STM
buluj/STM
minin/STM hule/STM
humunin/STM rig/STM

To do : Evaluation

jayanthkmr commented 7 years ago

@epageperron @khoidt I have completed the full data training as well as testing for Morphessor-Flatcat. I am still not clear on evaluation though as I don't have the gold standard segmentation for the testing dataset. Check your email and drive in the morning for the experiment code and testing results and let's set up a meeting on wrapping it up on how to evaluate the almost perfect segmentations from Morphessor Flatcat. :)

khoidt commented 7 years ago

@jayanthjaiswal Great job! As for the evaluation, the following is crucial:

Stem granularity (Total number of unique golden (control set) lemmata / Total number of unique predicted (test result) lemmata) -1

Correct token forms Total number of correctly predicted lemmata (i.e. corresponding with golden): quantity and percent

Incorrect token forms Total number of incorrectly predicted lemmata (i.e. not corresponding with golden): quantity and percent

Correct stem forms Total number of correctly predicted lemmata in stem test (i.e. lemma given as form): quantity and percent

Incorrect stem forms Total number of incorrectly predicted lemmata in stem test (i.e. lemma given as form): quantity and percent

If you have any questions on the evaluation, write me a PM on Slack. I'm available most of the time.

khoidt commented 7 years ago

Test 3: Marmot

Update:

See summary 1 below.

Training data:

Six sets of forms supplemented with lemma and POS information; entries are non-unique. Suppletion and variations between lemma and form are excluded (i.e. form always includes lemma).

Sets 1a, 2a, and 3a: 9 912 forms (80% of all the material)
Primary sets 1b, 2b, and 3b: 2 465 forms (80% of the ETSCRI material)

The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a-b) or escaped with Unicode characters, replacing the vowel in the sign (2a-b). Another two sets, 'sign and determinative' normalization (3a-b) have form and lemma represented only by the first sign and first determinative (if exists); indices are removed, as in 1a-b.

Testing data:

Training data consists of 10% of the total material of each set.

Evaluation:

Tests include token (forms checked against lemmata) and stem evaluation (i.e. lemmata as forms checked against themselves).

Set 1a:
Evaluated file: test_result_norm Control file: testing_full_data_norm Total forms: 1465 Stem granularity: 2.660633484162896 Correct token forms: 603 (41.160409556313994 %) Incorrect token forms: 862 (58.839590443686006 %) Correct stem forms: 860 (58.703071672354945 %) Incorrect stem forms: 605 (41.29692832764505 %)

Set 1b: Evaluated file: test_result_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Stem granularity: 1.3046875 Correct token forms: 107 (29.155313351498638 %) Incorrect token forms: 260 (70.84468664850137 %) Correct stem forms: 214 (58.310626702997276 %) Incorrect stem forms: 153 (41.689373297002724 %)

Set 2a: Evaluated file: test_result_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Stem granularity: 2.6986301369863015 Correct token forms: 608 (41.50170648464164 %) Incorrect token forms: 857 (58.49829351535836 %) Correct stem forms: 858 (58.56655290102389 %) Incorrect stem forms: 607 (41.43344709897611 %)

Set 2b: Evaluated file: test_result_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Stem granularity: 1.3412698412698414 Correct token forms: 106 (28.88283378746594 %) Incorrect token forms: 261 (71.11716621253406 %) Correct stem forms: 213 (58.038147138964575 %) Incorrect stem forms: 154 (41.961852861035425 %)

Set 3a:
Evaluated file: test_result_s_and_d Control file: testing_full_data_s_and_d Total forms: 1465 Stem granularity: 0.5787781350482315 Correct token forms: 855 (58.3617747440273 %) Incorrect token forms: 610 (41.63822525597269 %) Correct stem forms: 919 (62.73037542662116 %) Incorrect stem forms: 546 (37.26962457337884 %)

Set 3b: Evaluated file: test_result_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Stem granularity: 0.3783783783783783 Correct token forms: 258 (70.29972752043597 %) Incorrect token forms: 109 (29.700272479564035 %) Correct stem forms: 274 (74.65940054495913 %) Incorrect stem forms: 93 (25.340599455040874 %)

Remarks:

The results are seemingly worse than these with CSTlemmatizer.

khoidt commented 7 years ago

Summary 1. CSTLemma and Marmot Tests: Corrections and Summary

Changes:

Make CSTLemma tests (nos. 1-2) unambiguous; evaluation is now based on best (first) match
Correct stem granularity in the CSTLemma tests (1-2)
Export all tests to CSV and JSON
Add a little script (reports.py) to independently import (JSON), print, and export reports (CSV and JSON).

Data: The attached evaluation_cstlemma_marmot.zip contains:

JSON and CSV files representing each test and all of them together (eval_reports_all)
import JSON / export CSV/JSON/ print script (reports.py)
TXT for humans (printed_output.txt; yes, I should finally admit that at this point this issue contains far too much plainly posted evaluation data :see_no_evil:)

Remarks: @jayanthjaiswal, it would be great if you could use the same JSON data structure to easily integrate your evaluations when they are ready. You might also find useful some the updates in the evaluation code of the previous tests (the last one, Marmot, would be the best example) on our Google Drive in

epageperron commented 6 years ago

Are we done with this issue?

khoidt commented 6 years ago

I still have to update it with the new standard parallel corpus.

epageperron commented 5 years ago

Update ? @khoidt

epageperron commented 5 years ago

@khoit @jayanthjaiswal is it possible to update this issue ? thanks !

cdli-gh / mtaac_work