centre-for-humanities-computing / odyCy

A general-purpose NLP pipeline for Ancient Greek
https://centre-for-humanities-computing.github.io/odyCy/
MIT License
18 stars 2 forks source link

training data report #7

Closed jankounchained closed 1 year ago

jankounchained commented 1 year ago

From python3 -m spacy debug data configs/transformer.cfg

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: grc
Training pipeline: transformer, tagger, morphologizer, lemmatizer, parser,
senter, entity_ruler
2649 training docs
216 evaluation docs
✔ No overlap between training and evaluation data

============================== Vocab & Vectors ==============================
ℹ 346928 total word(s) in the data (54644 unique)
⚠ 445 misaligned tokens in the training data
⚠ 28 misaligned tokens in the dev data
ℹ No word vectors present in the package

=========================== Part-of-speech Tagging ===========================
ℹ 831 label(s) in train data
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'v2sfi----',
'v--amm---'.

========================= Morphologizer (POS+Morph) =========================
ℹ 1431 label(s) in train data
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training:
'Mood=Ind|Number=Sing|POS=AUX|Person=2|Tense=Fut|VerbForm=Fin',
'Mood=Imp|POS=VERB|Tense=Past|VerbForm=Fin|Voice=Mid'.

============================= Dependency Parsing =============================
ℹ Found 26432 sentence(s) with an average length of 13.1 words.
ℹ Found 2534 nonprojective train sentence(s)
ℹ 35 label(s) in train data
ℹ 415 label(s) in projectivized train data
⚠ Low number of examples for label 'dep' (4)
⚠ Low number of examples for 252 label(s) in the projectivized
dependency trees used for training. You may want to projectivize labels such as
punct before training in order to improve parser performance.

================================== Summary ==================================
✔ 3 checks passed
⚠ 9 warnings
jankounchained commented 1 year ago

marton checked this