Closed xssChauhan closed 5 years ago
Could you run the (experimental) debug-data
command and see if it produces any hints? I suspect what might be happening is that the corpus includes labels or tags that aren't in the tag map and spaCy doesn't fail very gracefully here – or something similar to that.
debug-data
currently isn't officially documented, but you can type python -m spacy debug-data --help
for docs. Here's and example command:
python -m spacy debug-data en /path/to/train.json /path/to/dev.json --pipeline tagger,parser
You can also add the --verbose
flag to make it show more details.
For the example in the docs:
>> python -m spacy debug-data es es_ancora-ud-train.jsonl es_ancora-ud-dev.jsonl --verbose
=========================== Data format validation ===========================
✔ Loaded es_ancora-ud-train.jsonl
✔ Loaded es_ancora-ud-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'es'
14305 training docs
1654 evaluation docs
✔ No overlap between training and evaluation data
============================== Vocab & Vectors ==============================
ℹ 444617 total words in the data (37523 unique)
10 most common words: 'de' (26711), ',' (24417), 'la' (15223), '.' (14179),
'que' (13184), 'el' (11901), 'en' (10519), 'y' (9242), 'a' (7943), '"' (7385)
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
444617 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
ℹ 17 labels in data (300 labels in tag map)
'NOUN' (81523), 'ADP' (71192), 'DET' (60656), 'PUNCT' (52915), 'VERB' (36311),
'PROPN' (34458), 'ADJ' (29439), 'PRON' (19947), 'ADV' (14496), 'AUX' (13779),
'CCONJ' (12225), 'SCONJ' (10129), 'NUM' (6929), 'SYM' (406), 'PART' (122),
'INTJ' (88), 'X' (2)
✘ Label 'VERB' not found in tag map for language 'es'
✘ Label 'PUNCT' not found in tag map for language 'es'
✘ Label 'NOUN' not found in tag map for language 'es'
✘ Label 'PROPN' not found in tag map for language 'es'
✘ Label 'ADV' not found in tag map for language 'es'
✘ Label 'ADJ' not found in tag map for language 'es'
✘ Label 'CCONJ' not found in tag map for language 'es'
✘ Label 'PRON' not found in tag map for language 'es'
✘ Label 'AUX' not found in tag map for language 'es'
✘ Label 'SCONJ' not found in tag map for language 'es'
✘ Label 'NUM' not found in tag map for language 'es'
✘ Label 'PART' not found in tag map for language 'es'
✘ Label 'SYM' not found in tag map for language 'es'
✘ Label 'INTJ' not found in tag map for language 'es'
✘ Label 'X' not found in tag map for language 'es'
============================= Dependency Parsing =============================
ℹ 139 labels in data
'case' (61501), 'det' (60246), 'punct' (51801), 'nmod' (31254), 'obj' (30273),
'nsubj' (23886), 'amod' (23811), 'obl' (19278), 'advmod' (16127), 'mark'
(15599), 'ROOT' (14305), 'conj' (12990), 'cc' (12190), 'flat' (11626), 'acl'
(8528), 'aux' (7857), 'advcl' (6951), 'fixed' (6417), 'appos' (6093), 'cop'
(4717), 'ccomp' (4671), 'nummod' (4519), 'xcomp' (2064), 'compound' (2010),
'iobj' (1437), 'csubj' (998), 'punct||conj' (993), 'parataxis' (534),
'expl:pass' (411), 'dep' (242), 'nsubj||conj' (120), 'mark||conj' (88),
'advmod||conj' (88), 'flat||det' (78), 'obj||conj' (62), 'obj||cc' (55),
'obl||conj' (51), 'punct||det' (44), 'cc||cc' (40), 'appos||det' (39),
'advcl||conj' (33), 'nsubj:pass' (30), 'aux||conj' (29), 'nmod||cc' (29),
'nsubj||cc' (29), 'nmod||det' (24), 'det||conj' (22), 'orphan' (22), 'obl||cc'
(20), 'ccomp||cc' (19), 'advcl||cc' (18), 'cc||conj' (17), 'mark||ccomp' (16),
'mark||advcl' (15), 'case||cc' (15), 'acl||det' (12), 'punct||xcomp' (10),
'xcomp||cc' (10), 'punct||ccomp' (10), 'punct||case' (9), 'acl||cc' (9),
'advmod||ccomp' (9), 'mark||acl' (9), 'punct||advcl' (8), 'obl||xcomp' (8),
'punct||acl' (8), 'amod||cc' (7), 'nsubj||advcl' (7), 'fixed||case' (7),
'obl||advcl' (7), 'obl||ccomp' (7), 'iobj||conj' (6), 'mark||xcomp' (6),
'advmod||cc' (6), 'nsubj||acl' (6), 'case||cop' (6), 'appos||cc' (6),
'compound||det' (6), 'csubj||cc' (5), 'nsubj||ccomp' (5), 'mark||cc' (5),
'mark||csubj' (4), 'cop||cc' (4), 'advcl||parataxis' (3), 'advmod||xcomp' (3),
'appos||obj' (3), 'det||cc' (3), 'case||amod' (3), 'advmod||acl' (3), 'obl||acl'
(3), 'advmod||advcl' (2), 'nsubj||xcomp' (2), 'flat||obj' (2), 'nummod||det'
(2), 'appos||conj' (2), 'obj||ccomp' (2), 'advcl||ccomp' (2), 'case||nmod' (2),
'obl||parataxis' (2), 'case||flat' (2), 'det||advcl' (2), 'advmod||parataxis'
(2), 'cop||conj' (2), 'det||obj' (1), 'nmod||obj' (1), 'acl||obj' (1),
'ccomp||conj' (1), 'obl||csubj' (1), 'advmod||csubj' (1), 'punct||csubj' (1),
'cop||ccomp' (1), 'case||appos' (1), 'fixed||mark' (1), 'aux||xcomp' (1),
'obj||xcomp' (1), 'aux||cc' (1), 'csubj:pass' (1), 'punct||flat' (1), 'obj||acl'
(1), 'flat||appos' (1), 'punct||parataxis' (1), 'amod||det' (1), 'punct||aux'
(1), 'appos||advmod' (1), 'advcl||advcl' (1), 'csubj||conj' (1), 'mark||cop'
(1), 'compound||aux' (1), 'det||ccomp' (1), 'nmod||conj' (1), 'advcl||acl' (1),
'amod||nummod' (1), 'amod||conj' (1), 'case||conj' (1), 'case||det' (1),
'obj||nmod' (1), 'nsubj||csubj' (1), 'obj||csubj' (1), 'obj||parataxis' (1)
================================== Summary ==================================
✔ 9 checks passed
✘ 15 errors
For the dataset that I am trying to work on:
>> python -m spacy debug-data en en-ud-tweet-train.jsonl en-ud-tweet-dev.jsonl
=========================== Data format validation ===========================
✔ Loaded en-ud-tweet-train.jsonl
✔ Loaded en-ud-tweet-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'en'
1639 training docs
710 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train from a blank model (1639)
============================== Vocab & Vectors ==============================
ℹ 24753 total words in the data (8564 unique)
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
24753 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
ℹ 42 labels in data (57 labels in tag map)
✘ Label 'V_V' not found in tag map for language 'en'
✘ Label 'N_N' not found in tag map for language 'en'
✘ Label 'P_P' not found in tag map for language 'en'
✘ Label 'D_D' not found in tag map for language 'en'
✘ Label 'R_R' not found in tag map for language 'en'
✘ Label 'A_A' not found in tag map for language 'en'
✘ Label ',_,' not found in tag map for language 'en'
✘ Label 'X' not found in tag map for language 'en'
✘ Label 'PUNCT' not found in tag map for language 'en'
✘ Label 'NOUN' not found in tag map for language 'en'
✘ Label 'PRON' not found in tag map for language 'en'
✘ Label 'VERB' not found in tag map for language 'en'
✘ Label 'PART' not found in tag map for language 'en'
✘ Label 'ADP' not found in tag map for language 'en'
✘ Label 'CCONJ' not found in tag map for language 'en'
✘ Label 'ADJ' not found in tag map for language 'en'
✘ Label '~_~' not found in tag map for language 'en'
✘ Label '@_@' not found in tag map for language 'en'
✘ Label 'O_O' not found in tag map for language 'en'
✘ Label 'L_L' not found in tag map for language 'en'
✘ Label '&_&' not found in tag map for language 'en'
✘ Label '#_#' not found in tag map for language 'en'
✘ Label 'U_U' not found in tag map for language 'en'
✘ Label 'E_E' not found in tag map for language 'en'
✘ Label '!_!' not found in tag map for language 'en'
✘ Label 'PROPN' not found in tag map for language 'en'
✘ Label 'NUM' not found in tag map for language 'en'
✘ Label 'ADV' not found in tag map for language 'en'
✘ Label 'DET' not found in tag map for language 'en'
✘ Label 'AUX' not found in tag map for language 'en'
✘ Label 'INTJ' not found in tag map for language 'en'
✘ Label 'SCONJ' not found in tag map for language 'en'
✘ Label '^_^' not found in tag map for language 'en'
✘ Label '$_$' not found in tag map for language 'en'
✘ Label 'G_G' not found in tag map for language 'en'
✘ Label 'T_T' not found in tag map for language 'en'
✘ Label 'X_X' not found in tag map for language 'en'
✘ Label 'Z_Z' not found in tag map for language 'en'
✘ Label 'S_S' not found in tag map for language 'en'
✘ Label 'Y_Y' not found in tag map for language 'en'
✘ Label 'M_M' not found in tag map for language 'en'
============================= Dependency Parsing =============================
ℹ 57 labels in data
================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
✘ 41 errors
(factmata) ➜ Tweebank git:(dev) ✗ python -m spacy debug-data en en-ud-tweet-train.jsonl en-ud-tweet-dev.jsonl --verbose
=========================== Data format validation ===========================
✔ Loaded en-ud-tweet-train.jsonl
✔ Loaded en-ud-tweet-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'en'
1639 training docs
710 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train from a blank model (1639)
It's recommended to use at least 2000 examples (minimum 100)
============================== Vocab & Vectors ==============================
ℹ 24753 total words in the data (8564 unique)
10 most common words: ':' (773), 'RT' (638), '.' (593), 'I' (398), 'the' (394),
'to' (367), ',' (345), 'a' (276), '!' (271), 'you' (269)
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
24753 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
ℹ 42 labels in data (57 labels in tag map)
'NOUN' (2251), 'PUNCT' (2116), 'X' (1998), 'VERB' (1699), 'PRON' (1452), 'PROPN'
(1356), 'V_V' (1288), 'N_N' (1110), 'ADP' (1061), ',_,' (944), 'ADJ' (839),
'AUX' (789), 'DET' (728), 'ADV' (686), 'P_P' (671), 'O_O' (600), 'D_D' (484),
'^_^' (463), 'A_A' (439), '@_@' (412), 'R_R' (383), 'PART' (368), 'NUM' (285),
'~_~' (284), 'CCONJ' (272), 'SYM' (265), 'L_L' (239), '!_!' (225), 'INTJ' (160),
'SCONJ' (150), '$_$' (131), '&_&' (120), 'U_U' (118), 'E_E' (90), '#_#' (85),
'G_G' (69), 'T_T' (54), 'Z_Z' (43), 'X_X' (15), 'S_S' (8), 'M_M' (2), 'Y_Y' (1)
✘ Label 'V_V' not found in tag map for language 'en'
✘ Label 'R_R' not found in tag map for language 'en'
✘ Label 'A_A' not found in tag map for language 'en'
✘ Label 'P_P' not found in tag map for language 'en'
✘ Label 'D_D' not found in tag map for language 'en'
✘ Label ',_,' not found in tag map for language 'en'
✘ Label '#_#' not found in tag map for language 'en'
✘ Label 'ADJ' not found in tag map for language 'en'
✘ Label 'NOUN' not found in tag map for language 'en'
✘ Label 'PUNCT' not found in tag map for language 'en'
✘ Label 'X' not found in tag map for language 'en'
✘ Label 'PROPN' not found in tag map for language 'en'
✘ Label 'NUM' not found in tag map for language 'en'
✘ Label '@_@' not found in tag map for language 'en'
✘ Label '!_!' not found in tag map for language 'en'
✘ Label 'N_N' not found in tag map for language 'en'
✘ Label 'O_O' not found in tag map for language 'en'
✘ Label 'DET' not found in tag map for language 'en'
✘ Label 'PRON' not found in tag map for language 'en'
✘ Label 'AUX' not found in tag map for language 'en'
✘ Label 'CCONJ' not found in tag map for language 'en'
✘ Label 'ADP' not found in tag map for language 'en'
✘ Label '~_~' not found in tag map for language 'en'
✘ Label 'L_L' not found in tag map for language 'en'
✘ Label 'INTJ' not found in tag map for language 'en'
✘ Label 'VERB' not found in tag map for language 'en'
✘ Label 'SCONJ' not found in tag map for language 'en'
✘ Label '&_&' not found in tag map for language 'en'
✘ Label 'ADV' not found in tag map for language 'en'
✘ Label '$_$' not found in tag map for language 'en'
✘ Label 'G_G' not found in tag map for language 'en'
✘ Label 'E_E' not found in tag map for language 'en'
✘ Label '^_^' not found in tag map for language 'en'
✘ Label 'Z_Z' not found in tag map for language 'en'
✘ Label 'PART' not found in tag map for language 'en'
✘ Label 'U_U' not found in tag map for language 'en'
✘ Label 'T_T' not found in tag map for language 'en'
✘ Label 'X_X' not found in tag map for language 'en'
✘ Label 'M_M' not found in tag map for language 'en'
✘ Label 'S_S' not found in tag map for language 'en'
✘ Label 'Y_Y' not found in tag map for language 'en'
============================= Dependency Parsing =============================
ℹ 57 labels in data
'punct' (3250), 'ROOT' (2470), 'discourse' (2139), 'nsubj' (1940), 'case'
(1484), 'obj' (1188), 'advmod' (1132), 'det' (1046), 'obl' (855), 'amod' (796),
'compound' (785), 'list' (694), 'vocative' (615), 'cop' (582), 'mark' (577),
'aux' (557), 'parataxis' (522), 'nmod' (500), 'conj' (467), 'nmod:poss' (432),
'cc' (411), 'xcomp' (405), 'nummod' (298), 'advcl' (273), 'ccomp' (232), 'flat'
(190), 'compound:prt' (123), 'acl:relcl' (116), 'appos' (109), 'acl' (105),
'obl:tmod' (80), 'aux:pass' (48), 'nsubj:pass' (46), 'obl:npmod' (40), 'iobj'
(36), 'goeswith' (28), 'nmod:tmod' (27), 'det:predet' (25), 'csubj' (24),
'nmod:npmod' (23), 'expl' (23), 'flat:foreign' (18), 'fixed' (16), 'reparandum'
(8), 'appos||obj' (4), 'case||obl' (2), 'acl||obj' (2), 'cc:preconj' (1),
'case||nsubj:pass' (1), 'obl||amod' (1), 'conj||obl' (1), 'obj||xcomp' (1),
'dislocated||xcomp' (1), 'orphan' (1), 'dislocated' (1), 'cc||conj' (1),
'conj||aux' (1)
================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
✘ 41 errors
How come the tags VERB
, PRON
, PROPN
are labelled as not present in tag map for en?
I am getting the exact same error message. Working in Python 3.7.1. Mac OSX 10.14.4. Spacy 2.1.3
I think I have the same error trying to evaluate model (with spacy 2.1.3 and 2.1.1 also):
python -m spacy convert /home/julm/work/data/ud/UD_English-EWT-master/en_ewt-ud-test.conllu ud_en_ewt -n 10 -l en python -m spacy evaluate en /home/julm/work/data/ud/ud_en_ewt/en_ewt-ud-test.jsonl`
Traceback (most recent call last): File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module> plac.call(commands[command], sys.argv[1:]) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/spacy/cli/evaluate.py", line 44, in evaluate corpus = GoldCorpus(data_path, data_path) File "gold.pyx", line 112, in spacy.gold.GoldCorpus.__init__ File "gold.pyx", line 125, in spacy.gold.GoldCorpus.write_msgpack KeyError: 1
By the way, evaluate works with UD Ukrainian json I've converted with spacy 2.0 (but with code change in convert)
@ines are you able to reproduce this error?? Seems like we all have are getting this error from running the code in the official documentation https://spacy.io/usage/training#spacy-train-cli The repository UniversalDependencies/UD_Spanish-AnCora has not been updated during the last 5 months so maybe the error is related to a spacy version upgrade?!
How come the tags VERB, PRON, PROPN are labelled as not present in tag map for en?
The tag map maps fine-grained tags from the treebank to coarse-grained universal tags – for example, in English, it'd map something like VBZ
to VERB
. In your treebank, the universal tags plus a bunch of custom ones are used. So essentially, you need to provide spaCy with a custom tag mao that matches the tag set in the data.
There are a few todos on this topic for us:
spacy train
I get the same error with UD_Norwegian-Bokmaal using spacy 2.1.3.
I have json-files generated with "spacy convert" from spacy 2.0.18 and I'm able to use these to train without error.
I think the issue is that spacy convert
changed behaviour. We should change it back for the next release.
spacy convert
is currently defaulting to jsonl-formatted output, which is causing problems. If you set -t json
it should work I think.
@honnibal : Thanks for looking into this so quickly! Problem solved!
@honnibal Thank you for pointing it out. I would like to help out with this issue. Any references that can help me get started?
@xssChauhan I think it could be as straightforward as changing the default value from "jsonl"
to "json"
here:
If you'd like to test this and submit a PR, that'd be great!
@ines Opened #3583 to fix this.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I am trying to train a new spacy model based on the Tweebank annotated data. For that I first tried using the training example given in the docs to familiarize myself with the procedure. Example and training on the Tweebank throw the same error.
How to reproduce the behaviour
Follow the example here For the sake of completeness:
Your Environment
Info about spaCy
The Error