Open khoidt opened 7 years ago
This comment lists cases (with examples) when a simple match between a form and a lemma is not possible. The list pretends to no completeness, but tentatively covers the most frequent study cases.
{d}asar-ri: asari gab2-bu-zu: gabu2
ku-še: kuš2 gu2-un: gun2 u8-a: u5-a {gi}ub4-zal-zu: ub-zal {ŋiš}ildag3: ildag2 esir-bi: esir2 a-ah: ah6
{d}mu-ul-lil2: en-lil2 {d}na-zi: nanše šu-ru-ug-ga: šarag {ŋiš}i-ri9-na-zu: erina8
lu2-lul: lu2-lul-la
{d}meš3-ki-aŋ2-{d}nanna meš3-ki-aŋ2-nanna
li-mu-um-ta-am3: 1000
im-ma-ni-ŋa2-ŋa2: ŋar na-ma-tum3: de6 dur2-ru-na-ba: tuš
šar2-ra-ab-du: šar2-ra-ab-DU niŋ2-ur2-4-e: niŋ2-ur2-limmu2
While in cases 1-2 and 5 the token's lemma can be easily identified after merging the signs or removing the determinatives, the others pose a somewhat more complex problem. A possible solution would be a lexicon to match variants and suppletion (esp. 3, 6 and 7) and a sign-list to detect transliteration inconsistencies (esp. 8).
I have a feeling this type of problems will be much more rare in our administrative corpus. As for the sign list, we have some sign lists and glossaries, we also have been discussing of using the ePSD. I will follow up ASAP on this and get back to you.
Follow-up: See issue #7
See most recently summary 1 and earlier update 1 and update 2 below.
Training data: Four sets of forms supplemented with lemma and POS information; entries are unique.
The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a and 2a) or escaped with Unicode characters, replacing the vowel in the sign (1b and 2b).
Testing data: Training data consists of 10% of the total material of each set.
Evaluation: Results yielding multiple lemmata as variants are considered as correct if include the correct lemma.
Set 1a: Total forms: 3109 Correct: 1185 (38.115149565776775) Incorrect: 1924 (61.884850434223225)
Set 1b: Total forms: 3109 Correct: 1056 (33.96590543583146) Incorrect: 2053 (66.03409456416854)
Set 2a: Total forms: 691 Correct: 342 (49.493487698986975) Incorrect: 349 (50.506512301013025)
Set 2b: Total forms: 691 Correct: 305 (44.138929088277855) Incorrect: 386 (55.86107091172214)
Issues: I was not been able to get CSTlemmatizer to include POS information in the results.
See description above.
See most recently summary 1 and earlier update 2 below.
Changes: data set for testing contains non-unique entries; stem evaluation, and granularity are added.
Evaluation:
Set 1a:
Total forms: 3089
Token granularity: 1.941904761904762
Stem granularity: 1.6931124673060158
Correct token forms: 1050 (33.99158303658142)
Incorrect token forms: 2039 (66.00841696341858)
Correct stem forms: 1147 (37.13175785043703)
Incorrect stem forms: 1942 (62.86824214956297)
Set 1b: Total forms: 708 Token granularity: 1.058139534883721 Stem granularity: 0.6090909090909091 Correct token forms: 344 (48.58757062146893) Incorrect token forms: 364 (51.41242937853107) Correct stem forms: 440 (62.14689265536723) Incorrect stem forms: 268 (37.85310734463277)
Set 2a: Total forms: 708 Token granularity: 1.058139534883721 Stem granularity: 0.6090909090909091 Correct token forms: 344 (48.58757062146893) Incorrect token forms: 364 (51.41242937853107) Correct stem forms: 440 (62.14689265536723) Incorrect stem forms: 268 (37.85310734463277)
Set 2b: Total forms: 708 Token granularity: 1.1585365853658538 Stem granularity: 0.7395577395577395 Correct token forms: 328 (46.32768361581921) Incorrect token forms: 380 (53.67231638418079) Correct stem forms: 407 (57.48587570621469) Incorrect stem forms: 301 (42.51412429378531)
See summary 1 below, description above and first update.
Changes:
s_and_d
, sets 3a-b): form and lemma are represented only by the first sign and first determinative (if presents); indices are removed, as in norm
.Evaluation:
Set 1a: Evaluated file: test_result_norm Control file: testing_full_data_norm Total forms: 1465 Token granularity: 1.4214876033057853 Stem granularity: 1.1512481644640236 Correct token forms: 605 (41.29692832764505 %) Incorrect token forms: 860 (58.703071672354945 %) Correct stem forms: 681 (46.484641638225256 %) Incorrect stem forms: 784 (53.515358361774744 %) Av. suggestions: 1.0177474402730375 Av. suggestions correct tokens: 1.0132231404958678 Av. suggestions incorrect tokens: 1.0209302325581395
Set 1b: Evaluated file: test_result_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Token granularity: 0.9015544041450778 Stem granularity: 0.6167400881057268 Correct token forms: 193 (52.58855585831063 %) Incorrect token forms: 174 (47.41144414168937 %) Correct stem forms: 227 (61.85286103542234 %) Incorrect stem forms: 140 (38.14713896457766 %) Av. suggestions: 1.0163487738419619 Av. suggestions correct tokens: 1.0207253886010363 Av. suggestions incorrect tokens: 1.0114942528735633
Set 2a: Evaluated file: test_result_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Token granularity: 1.437603993344426 Stem granularity: 1.1996996996996998 Correct token forms: 601 (41.02389078498293 %) Incorrect token forms: 864 (58.97610921501706 %) Correct stem forms: 666 (45.46075085324232 %) Incorrect stem forms: 799 (54.539249146757676 %) Av. suggestions: 1.0197952218430033 Av. suggestions correct tokens: 1.0116472545757071 Av. suggestions incorrect tokens: 1.025462962962963
Set 2b: Evaluated file: test_result_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Token granularity: 0.8535353535353536 Stem granularity: 0.5887445887445888 Correct token forms: 198 (53.950953678474114 %) Incorrect token forms: 169 (46.049046321525886 %) Correct stem forms: 231 (62.94277929155314 %) Incorrect stem forms: 136 (37.05722070844687 %) Av. suggestions: 1.008174386920981 Av. suggestions correct tokens: 1.005050505050505 Av. suggestions incorrect tokens: 1.0118343195266273
Set 3a: Evaluated file: test_result_s_and_d Control file: testing_full_data_s_and_d Total forms: 1465 Token granularity: 0.5372507869884575 Stem granularity: 0.47681451612903225 Correct token forms: 953 (65.05119453924914 %) Incorrect token forms: 512 (34.948805460750854 %) Correct stem forms: 992 (67.71331058020478 %) Incorrect stem forms: 473 (32.28668941979522 %) Av. suggestions: 7.82320819112628 Av. suggestions correct tokens: 7.961175236096537 Av. suggestions incorrect tokens: 7.56640625
Set 3b: Evaluated file: test_result_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Token granularity: 0.34926470588235303 Stem granularity: 0.3345454545454545 Correct token forms: 272 (74.11444141689374 %) Incorrect token forms: 95 (25.885558583106267 %) Correct stem forms: 275 (74.93188010899183 %) Incorrect stem forms: 92 (25.068119891008173 %) Av. suggestions: 4.822888283378747 Av. suggestions correct tokens: 3.4375 Av. suggestions incorrect tokens: 8.789473684210526
Remarks:
Note the significantly better results in comparison to the previous versions of the test. However, it should be noted that the 'sign and determinative' normalization, which demonstrates the highest rate of correct forms, has also a very high rate of ambiguity ('suggestions').
See summary 1 below.
Six sets of forms supplemented with lemma and POS information; entries are non-unique. Suppletion and variations between lemma and form are excluded (i.e. form always includes lemma).
The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a-b) or escaped with Unicode characters, replacing the vowel in the sign (2a-b). Another two sets, 'sign and determinative' normalization (3a-b) have form and lemma represented only by the first sign and first determinative (if exists); indices are removed, as in 1a-b.
Training data consists of 10% of the total material of each set.
Results yielding multiple lemmata as variants are considered as correct if have the correct lemma among the choices. Tests include token (forms checked against lemmata) and stem evaluation (i.e. lemmata as forms checked against themselves) and average suggestions pro token.
Set 1a:
Evaluated file: test_result_stem_norm
Control file: testing_full_data_norm
Total forms: 1465
Token granularity: 5.02880658436214
Stem granularity: 0.0020519835841312783
Correct token forms: 243 (16.58703071672355 %)
Incorrect token forms: 1222 (83.41296928327645 %)
Correct stem forms: 1462 (99.7952218430034 %)
Incorrect stem forms: 3 (0.20477815699658702 %)
Av. suggestions: 1.0614334470989761
Av. suggestions correct tokens: 1.3580246913580247
Av. suggestions incorrect tokens: 1.0024549918166938
Set 1b: Evaluated file: test_result_stem_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Token granularity: 2.67 Stem granularity: 0.0 Correct token forms: 100 (27.247956403269754 %) Incorrect token forms: 267 (72.75204359673025 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 1.0163487738419619 Av. suggestions correct tokens: 1.06 Av. suggestions incorrect tokens: 1.0
Set 2a: Evaluated file: test_result_stem_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Token granularity: 4.979591836734694 Stem granularity: 0.0013670539986330166 Correct token forms: 245 (16.72354948805461 %) Incorrect token forms: 1220 (83.27645051194538 %) Correct stem forms: 1463 (99.86348122866895 %) Incorrect stem forms: 2 (0.13651877133105803 %) Av. suggestions: 1.0607508532423209 Av. suggestions correct tokens: 1.3551020408163266 Av. suggestions incorrect tokens: 1.001639344262295
Set 2b: Evaluated file: test_result_stem_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Token granularity: 2.707070707070707 Stem granularity: 0.0 Correct token forms: 99 (26.975476839237057 %) Incorrect token forms: 268 (73.02452316076294 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 1.005449591280654 Av. suggestions correct tokens: 1.02020202020202 Av. suggestions incorrect tokens: 1.0
Set 3a: Evaluated file: test_result_stem_s_and_d Control file: testing_full_data_s_and_d Total forms: 1465 Token granularity: 0.4694082246740221 Stem granularity: 0.0 Correct token forms: 997 (68.05460750853243 %) Incorrect token forms: 468 (31.945392491467576 %) Correct stem forms: 1465 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 4.286689419795222 Av. suggestions correct tokens: 5.829488465396189 Av. suggestions incorrect tokens: 1.0
Set 3b: Evaluated file: test_result_stem_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Token granularity: 0.20327868852459008 Stem granularity: 0.0 Correct token forms: 305 (83.10626702997276 %) Incorrect token forms: 62 (16.893732970027248 %) Correct stem forms: 367 (100.0 %) Incorrect stem forms: 0 (0.0 %) Av. suggestions: 2.7193460490463215 Av. suggestions correct tokens: 3.0688524590163935 Av. suggestions incorrect tokens: 1.0
Note the unexpected drop of correct token form alongside with excellent stem test results (around 100% in all tests). The results for sets 3a-b are significantly better than in test 1 in general and - importantly - demonstrate a slightly lower av. of suggestions.
@epageperron @khoidt Morphessor: Work in Progress.
Set up and Training on small annotations Done:
flatcat-train experiments/akk-data/segmentation.txt --perplexity-threshold 100 --save-binary-model model.pickled --statsfile stats.pickled --stats-annotations experiments/akk-data/annotations.txt
Initializing from segmentation...
INFO:flatcat.io:Reading segmentations from 'experiments/akk-data/segmentation.txt'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:flatcat.io:Done.
INFO:flatcat.categorizationscheme:Setting perplexity-threshold to 100.0
INFO:flatcat.flatcat:Iteration 0 (reestimate_with_unchanged_segmentation). 1/15
INFO:flatcat.flatcat:Segmentation differences: 3 (limit 0). Cost difference: 0.0
INFO:flatcat.flatcat:Iteration 0 (reestimate_with_unchanged_segmentation). 2/15
INFO:flatcat.flatcat:Segmentation differences: 2 (limit 0). Cost difference: -2.1993396743
INFO:flatcat.flatcat:Iteration 0 (reestimate_with_unchanged_segmentation). 3/15
INFO:flatcat.flatcat:Segmentation differences: 0 (limit 0). in iteration 3 (Converged).
INFO:flatcat.io:Reading annotations from 'experiments/akk-data/annotations.txt'...
INFO:flatcat.io:Done.
INFO:flatcat.flatcat:epoch 1/4 Cost: 459.8622.
INFO:flatcat.flatcat:Epoch 1, operation 0 (split), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 459.8622.
.
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/1 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 459.862210484
INFO:flatcat.flatcat:Epoch 1, operation 1 (join), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 459.8622.
.
INFO:flatcat.flatcat:Cost difference -60.6505 (limit 0.025) in iteration 1/1
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 399.211731719
INFO:flatcat.flatcat:Epoch 1, operation 2 (resegment), max 2 iteration(s).
INFO:flatcat.flatcat:iteration 1/2 Cost: 399.2117.
INFO:flatcat.flatcat:Before iteration update. Cost: 399.211731719
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/2 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 399.211731719
INFO:flatcat.flatcat:Cost difference -60.6505 (limit None) in epoch 1/4 (fixed number of epochs)
INFO:flatcat.flatcat:epoch 2/4 Cost: 399.2117.
INFO:flatcat.flatcat:Epoch 2, operation 0 (split), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 399.2117.
.
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/1 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 399.211731719
INFO:flatcat.flatcat:Epoch 2, operation 1 (join), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 399.2117.
.
INFO:flatcat.flatcat:Cost difference -27.9765 (limit 0.025) in iteration 1/1
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 371.235273963
INFO:flatcat.flatcat:Epoch 2, operation 2 (resegment), max 2 iteration(s).
INFO:flatcat.flatcat:iteration 1/2 Cost: 371.2353.
INFO:flatcat.flatcat:Before iteration update. Cost: 371.235273963
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/2 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 371.235273963
INFO:flatcat.flatcat:Cost difference -27.9765 (limit None) in epoch 2/4 (fixed number of epochs)
INFO:flatcat.flatcat:epoch 3/4 Cost: 371.2353.
INFO:flatcat.flatcat:Epoch 3, operation 0 (split), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 371.2353.
.
INFO:flatcat.flatcat:Cost difference -1.6536 (limit 0.025) in iteration 1/1
INFO:flatcat.flatcat:final iteration (max iterations reached). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch 3, operation 1 (join), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 369.5817.
.
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/1 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch 3, operation 2 (resegment), max 2 iteration(s).
INFO:flatcat.flatcat:iteration 1/2 Cost: 369.5817.
INFO:flatcat.flatcat:Before iteration update. Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/2 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference -1.6536 (limit None) in epoch 3/4 (fixed number of epochs)
INFO:flatcat.flatcat:epoch 4/4 Cost: 369.5817.
INFO:flatcat.flatcat:Epoch 4, operation 0 (split), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 369.5817.
.
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/1 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch 4, operation 1 (join), max 1 iteration(s).
INFO:flatcat.flatcat:iteration 1/1 Cost: 369.5817.
.
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/1 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Epoch 4, operation 2 (resegment), max 2 iteration(s).
INFO:flatcat.flatcat:iteration 1/2 Cost: 369.5817.
INFO:flatcat.flatcat:Before iteration update. Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference 0.0000 (limit 0.025) in iteration 1/2 (converged)
INFO:flatcat.flatcat:final iteration (converged). Cost: 369.581687673
INFO:flatcat.flatcat:Cost difference 0.0000 (limit None) in epoch 4/4 (fixed number of epochs)
Final cost: 369.581687673
Training time: 0.338s
Saving binary model...
INFO:morfessor.io:Saving model to 'model.pickled'...
INFO:morfessor.io:Done.
Done.
Sample seen Input:
ezida
bamul
lubauke
harasigsig
mundadage
gurece
namjurucju
imajajane
gacananajen
ninjirsukada
Sample seen Output:
ezida/STM
ba/STM mul/STM
lubau/STM ke/SUF
hara/STM sig/STM sig/STM
mun/STM dadag/STM e/SUF
gur ece/STM
namjuruc/STM ju/SUF
ima/STM jar jane/STM
gaca/STM inana jen/STM
ninjirsu/STM kada/STM
Sample Unseen Input:
nubcibgigia
humucingigi
jectugara
jalajuce
muntemga
nusigace
anziba
buluj
mininhule
humuninrig
Sample Unseen Output:
nubcib/STM gigia/STM
humu/STM cingigi/SUF
jectug/STM ara/STM
jala/STM juce/STM
mun/STM temga/STM
nusiga/STM ce/SUF
anziba/STM
buluj/STM
minin/STM hule/STM
humunin/STM rig/STM
To do : Evaluation
@epageperron @khoidt I have completed the full data training as well as testing for Morphessor-Flatcat. I am still not clear on evaluation though as I don't have the gold standard segmentation for the testing dataset. Check your email and drive in the morning for the experiment code and testing results and let's set up a meeting on wrapping it up on how to evaluate the almost perfect segmentations from Morphessor Flatcat. :)
@jayanthjaiswal Great job! As for the evaluation, the following is crucial:
Stem granularity (Total number of unique golden (control set) lemmata / Total number of unique predicted (test result) lemmata) -1
Correct token forms Total number of correctly predicted lemmata (i.e. corresponding with golden): quantity and percent
Incorrect token forms Total number of incorrectly predicted lemmata (i.e. not corresponding with golden): quantity and percent
Correct stem forms Total number of correctly predicted lemmata in stem test (i.e. lemma given as form): quantity and percent
Incorrect stem forms Total number of incorrectly predicted lemmata in stem test (i.e. lemma given as form): quantity and percent
If you have any questions on the evaluation, write me a PM on Slack. I'm available most of the time.
See summary 1 below.
Six sets of forms supplemented with lemma and POS information; entries are non-unique. Suppletion and variations between lemma and form are excluded (i.e. form always includes lemma).
The form and the lemma are "normalized" Sumerian without sign boundaries and graphic reduplication of consonants and vowels. Sign Indices are removed (1a-b) or escaped with Unicode characters, replacing the vowel in the sign (2a-b). Another two sets, 'sign and determinative' normalization (3a-b) have form and lemma represented only by the first sign and first determinative (if exists); indices are removed, as in 1a-b.
Training data consists of 10% of the total material of each set.
Tests include token (forms checked against lemmata) and stem evaluation (i.e. lemmata as forms checked against themselves).
Set 1a:
Evaluated file: test_result_norm
Control file: testing_full_data_norm
Total forms: 1465
Stem granularity: 2.660633484162896
Correct token forms: 603 (41.160409556313994 %)
Incorrect token forms: 862 (58.839590443686006 %)
Correct stem forms: 860 (58.703071672354945 %)
Incorrect stem forms: 605 (41.29692832764505 %)
Set 1b: Evaluated file: test_result_primary_norm Control file: testing_full_data_primary_norm Total forms: 367 Stem granularity: 1.3046875 Correct token forms: 107 (29.155313351498638 %) Incorrect token forms: 260 (70.84468664850137 %) Correct stem forms: 214 (58.310626702997276 %) Incorrect stem forms: 153 (41.689373297002724 %)
Set 2a: Evaluated file: test_result_norm_u Control file: testing_full_data_norm_u Total forms: 1465 Stem granularity: 2.6986301369863015 Correct token forms: 608 (41.50170648464164 %) Incorrect token forms: 857 (58.49829351535836 %) Correct stem forms: 858 (58.56655290102389 %) Incorrect stem forms: 607 (41.43344709897611 %)
Set 2b: Evaluated file: test_result_primary_norm_u Control file: testing_full_data_primary_norm_u Total forms: 367 Stem granularity: 1.3412698412698414 Correct token forms: 106 (28.88283378746594 %) Incorrect token forms: 261 (71.11716621253406 %) Correct stem forms: 213 (58.038147138964575 %) Incorrect stem forms: 154 (41.961852861035425 %)
Set 3a:
Evaluated file: test_result_s_and_d
Control file: testing_full_data_s_and_d
Total forms: 1465
Stem granularity: 0.5787781350482315
Correct token forms: 855 (58.3617747440273 %)
Incorrect token forms: 610 (41.63822525597269 %)
Correct stem forms: 919 (62.73037542662116 %)
Incorrect stem forms: 546 (37.26962457337884 %)
Set 3b: Evaluated file: test_result_primary_s_and_d Control file: testing_full_data_primary_s_and_d Total forms: 367 Stem granularity: 0.3783783783783783 Correct token forms: 258 (70.29972752043597 %) Incorrect token forms: 109 (29.700272479564035 %) Correct stem forms: 274 (74.65940054495913 %) Incorrect stem forms: 93 (25.340599455040874 %)
The results are seemingly worse than these with CSTlemmatizer.
Changes:
reports.py
) to independently import (JSON), print, and export reports (CSV and JSON). Data: The attached evaluation_cstlemma_marmot.zip contains:
eval_reports_all
)reports.py
)printed_output.txt
; yes, I should finally admit that at this point this issue contains far too much plainly posted evaluation data :see_no_evil:) Remarks: @jayanthjaiswal, it would be great if you could use the same JSON data structure to easily integrate your evaluations when they are ready. You might also find useful some the updates in the evaluation code of the previous tests (the last one, Marmot, would be the best example) on our Google Drive in
Are we done with this issue?
I still have to update it with the new standard parallel corpus.
Update ? @khoidt
@khoit @jayanthjaiswal is it possible to update this issue ? thanks !
Summary
For the project, we need to test tools for automated lemmatization, stemming and morphology induction. Please try to get them running on and evaluate against ETCSL CoNLL data. If you run into difficulties, you can ask Kathrin for help (she's also preferring Python), if that doesn't help, have a try with the next tool on the list. Note that our top priority is to eliminate morphological variation, not so much to produce linguistically plausible lemmas.
Tasks
Notes
CSTlemmatizer, Affixtrain and Marmot are supervised, train on WORD + LEMMA (+ POS + MORPH).
Roadmap Data
🗓 Start Date: 08-01-2017
🗓 Expected Date: 11-15-2017
💪 Label: wp
📈 Progress (0-1): 0.4
See Gantt: http://cdli-dev.org/gantt/mtaac_work/