From python3 -m spacy debug data configs/transformer.cfg
============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable
=============================== Training stats ===============================
Language: grc
Training pipeline: transformer, tagger, morphologizer, lemmatizer, parser,
senter, entity_ruler
2649 training docs
216 evaluation docs
✔ No overlap between training and evaluation data
============================== Vocab & Vectors ==============================
ℹ 346928 total word(s) in the data (54644 unique)
⚠ 445 misaligned tokens in the training data
⚠ 28 misaligned tokens in the dev data
ℹ No word vectors present in the package
=========================== Part-of-speech Tagging ===========================
ℹ 831 label(s) in train data
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'v2sfi----',
'v--amm---'.
========================= Morphologizer (POS+Morph) =========================
ℹ 1431 label(s) in train data
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training:
'Mood=Ind|Number=Sing|POS=AUX|Person=2|Tense=Fut|VerbForm=Fin',
'Mood=Imp|POS=VERB|Tense=Past|VerbForm=Fin|Voice=Mid'.
============================= Dependency Parsing =============================
ℹ Found 26432 sentence(s) with an average length of 13.1 words.
ℹ Found 2534 nonprojective train sentence(s)
ℹ 35 label(s) in train data
ℹ 415 label(s) in projectivized train data
⚠ Low number of examples for label 'dep' (4)
⚠ Low number of examples for 252 label(s) in the projectivized
dependency trees used for training. You may want to projectivize labels such as
punct before training in order to improve parser performance.
================================== Summary ==================================
✔ 3 checks passed
⚠ 9 warnings
From
python3 -m spacy debug data configs/transformer.cfg