PonteIneptique / latin-lasla-models

Repository for LASLA Latin models
Mozilla Public License 2.0
4 stars 0 forks source link

Pie Latin LASLA Models

DOI

Repository for LASLA Latin models: the models were fine-tuned by Thibault Clérice, data are based on LASLA data but some adaptation might be found.

Download models

Check latest release, under assets

Information about the model

Note: the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.

The training set is roughly 1.5M tokens, dev test roughly 10k and test 169822. This is not counting punctuation, as LASLA data are lacking punctuation.

Scores

Model LASLA

More details:

lemma

accuracy precision recall support
all 0.9734 0.8216 0.8196 169822
known-tokens 0.9785 0.907 0.907 161674
unknown-tokens 0.8716 0.7172 0.7153 8148
ambiguous-tokens 0.9292 0.7114 0.7171 41561
unknown-targets 0.4775 0.3136 0.3115 1131

pos

accuracy precision recall support
all 0.9651 0.8794 0.8669 169822
known-tokens 0.9672 0.8808 0.8703 161674
unknown-tokens 0.9232 0.6979 0.6511 8148
ambiguous-tokens 0.91 0.8234 0.784 52129

Gend

accuracy precision recall support
all 0.965 0.9166 0.9203 169822
known-tokens 0.9673 0.9198 0.9248 161674
unknown-tokens 0.9201 0.8673 0.8543 8148
ambiguous-tokens 0.868 0.8652 0.8747 34690

Numb

accuracy precision recall support
all 0.9719 0.9705 0.9697 169822
known-tokens 0.9731 0.9716 0.9705 161674
unknown-tokens 0.9482 0.9224 0.9358 8148
ambiguous-tokens 0.9042 0.9013 0.8979 38122

Case

accuracy precision recall support
all 0.9219 0.8811 0.8177 169822
known-tokens 0.9244 0.8865 0.8237 161674
unknown-tokens 0.8719 0.6896 0.6738 8148
ambiguous-tokens 0.8296 0.8196 0.7667 63352

Deg

accuracy precision recall support
all 0.9813 0.9694 0.971 169822
known-tokens 0.9832 0.9711 0.9746 161674
unknown-tokens 0.9434 0.9345 0.9149 8148
ambiguous-tokens 0.9186 0.906 0.9258 27870

Mood_Tense_Voice

accuracy precision recall support
all 0.9831 0.7845 0.7355 169822
known-tokens 0.9868 0.8039 0.7632 161674
unknown-tokens 0.91 0.6172 0.5863 8148
ambiguous-tokens 0.924 0.6879 0.675 16963

Person

accuracy precision recall support
all 0.9971 0.9875 0.9772 169822
known-tokens 0.9978 0.989 0.9814 161674
unknown-tokens 0.9834 0.9762 0.9536 8148
ambiguous-tokens 0.9768 0.9391 0.9068 10040

Dis

accuracy precision recall support
all 0.9727 0.879 0.8797 169822
known-tokens 0.9738 0.8803 0.8823 161674
unknown-tokens 0.9519 0.6651 0.5761 8148
ambiguous-tokens 0.912 0.8603 0.8649 41821

Score on other corpora

Glaise, Part 2

Definition to come. POS has no NOMcom vs. NOMpro (needs to be fixed in a later training.)

Task Accuracy Accuracy on V != _
lemma 94.73 94.73
Deg 97.64 93.46
Numb 96.84 96.28
Person 99.85 99.75
Mood_Tense_Voice 98.18 93.77
Case 93.05 88.96
Gend 96.29 89.93
pos 65.11 65.11

Model LASLA Plus

The model LASLA+ is trained on additionnal data, creating some noise in the original dataset and resulting in apparently worse results on classical data (approxim. -0.3%). It's results are detailed in LASLA-plus.md.

More details:

task score All - - Known tokens - - Unknown tokens - - Ambiguous tokens - -
task score Acc Pre Rec Acc Pre Rec Acc Pre Rec Acc Pre Rec
Case 3.29167 94.64 90.38 88.82 94.86 91.05 89.56 90.18 74.51 70.96 88.53 87.46 86.19
Deg 1.58333 98.41 97.61 97.83 98.59 97.78 98.1 94.72 94.2 93.23 93.57 93.78 94.69
Dis 2.54167 97.93 89.08 93.73 97.98 89.15 93.85 97.05 70.66 67.94 92.97 87.68 92.04
Gend 1.5 97.44 94.29 93.99 97.63 94.55 94.39 93.5 90.09 88.15 91.71 92.12 92.14
Mood_Tense_Voice 1.08333 98.9 90.21 86.81 99.07 90.88 88.88 95.37 81.31 83.74 93.91 83.72 82.34
Numb 2.29167 98.12 98.1 97.92 98.26 98.24 98.06 95.3 93.68 93.33 93.87 93.83 93.33
Person 1.5 99.77 99.12 98.2 99.83 99.25 98.64 98.49 98.05 95.65 98.29 96.08 93.33
lemma 1.66667 97.32 84.41 84.09 97.72 90.53 90.52 89.23 76.31 75.96 92.56 69.62 70.3
pos 1.95833 96.8 95.29 95.27 97.08 95.9 95.5 91.22 73.59

Score on other corpora

Glaise, Part 2

Definition to come

Task Accuracy Accuracy on V != _
lemma 95.59 95.59
Deg 97.79 93.35
Numb 96.68 95.88
Person 99.9 99.62
Mood_Tense_Voice 98.39 94.72
Case 92.89 88.42
Gend 95.83 89.15
pos 93.75 93.75

Credits

The web application and its maintenance is done by Thibault Clérice ( @ponteineptique ). To learn how to cite this repository, go check our releases.

Information about the model

LASLA Logo

The model is based on the LASLA data.