explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

Training a new model using cli throws error `KeyError` #3523

Closed xssChauhan closed 5 years ago

xssChauhan commented 5 years ago

I am trying to train a new spacy model based on the Tweebank annotated data. For that I first tried using the training example given in the docs to familiarize myself with the procedure. Example and training on the Tweebank throw the same error.

How to reproduce the behaviour

Follow the example here For the sake of completeness:

git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.jsonl ancora-json/es_ancora-ud-dev.jsonl

Your Environment

Info about spaCy

The Error

>>> python -m spacy train es models es_ancora-ud-train.jsonl es_ancora-ud-dev.jsonl
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'es'
Counting training words (limit=0)
Traceback (most recent call last):
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/shikhar/.conda/envs/factmata/lib/python3.6/site-packages/spacy/cli/train.py", line 196, in train
    corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
  File "gold.pyx", line 112, in spacy.gold.GoldCorpus.__init__
  File "gold.pyx", line 125, in spacy.gold.GoldCorpus.write_msgpack
KeyError: 1
ines commented 5 years ago

Could you run the (experimental) debug-data command and see if it produces any hints? I suspect what might be happening is that the corpus includes labels or tags that aren't in the tag map and spaCy doesn't fail very gracefully here – or something similar to that.

debug-data currently isn't officially documented, but you can type python -m spacy debug-data --help for docs. Here's and example command:

python -m spacy debug-data en /path/to/train.json /path/to/dev.json --pipeline tagger,parser

You can also add the --verbose flag to make it show more details.

xssChauhan commented 5 years ago

For the example in the docs:

 >> python -m spacy debug-data es es_ancora-ud-train.jsonl es_ancora-ud-dev.jsonl --verbose

=========================== Data format validation ===========================
✔ Loaded es_ancora-ud-train.jsonl
✔ Loaded es_ancora-ud-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'es'
14305 training docs
1654 evaluation docs
✔ No overlap between training and evaluation data

============================== Vocab & Vectors ==============================
ℹ 444617 total words in the data (37523 unique)
10 most common words: 'de' (26711), ',' (24417), 'la' (15223), '.' (14179),
'que' (13184), 'el' (11901), 'en' (10519), 'y' (9242), 'a' (7943), '"' (7385)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
444617 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace

=========================== Part-of-speech Tagging ===========================
ℹ 17 labels in data (300 labels in tag map)
'NOUN' (81523), 'ADP' (71192), 'DET' (60656), 'PUNCT' (52915), 'VERB' (36311),
'PROPN' (34458), 'ADJ' (29439), 'PRON' (19947), 'ADV' (14496), 'AUX' (13779),
'CCONJ' (12225), 'SCONJ' (10129), 'NUM' (6929), 'SYM' (406), 'PART' (122),
'INTJ' (88), 'X' (2)
✘ Label 'VERB' not found in tag map for language 'es'
✘ Label 'PUNCT' not found in tag map for language 'es'
✘ Label 'NOUN' not found in tag map for language 'es'
✘ Label 'PROPN' not found in tag map for language 'es'
✘ Label 'ADV' not found in tag map for language 'es'
✘ Label 'ADJ' not found in tag map for language 'es'
✘ Label 'CCONJ' not found in tag map for language 'es'
✘ Label 'PRON' not found in tag map for language 'es'
✘ Label 'AUX' not found in tag map for language 'es'
✘ Label 'SCONJ' not found in tag map for language 'es'
✘ Label 'NUM' not found in tag map for language 'es'
✘ Label 'PART' not found in tag map for language 'es'
✘ Label 'SYM' not found in tag map for language 'es'
✘ Label 'INTJ' not found in tag map for language 'es'
✘ Label 'X' not found in tag map for language 'es'

============================= Dependency Parsing =============================
ℹ 139 labels in data
'case' (61501), 'det' (60246), 'punct' (51801), 'nmod' (31254), 'obj' (30273),
'nsubj' (23886), 'amod' (23811), 'obl' (19278), 'advmod' (16127), 'mark'
(15599), 'ROOT' (14305), 'conj' (12990), 'cc' (12190), 'flat' (11626), 'acl'
(8528), 'aux' (7857), 'advcl' (6951), 'fixed' (6417), 'appos' (6093), 'cop'
(4717), 'ccomp' (4671), 'nummod' (4519), 'xcomp' (2064), 'compound' (2010),
'iobj' (1437), 'csubj' (998), 'punct||conj' (993), 'parataxis' (534),
'expl:pass' (411), 'dep' (242), 'nsubj||conj' (120), 'mark||conj' (88),
'advmod||conj' (88), 'flat||det' (78), 'obj||conj' (62), 'obj||cc' (55),
'obl||conj' (51), 'punct||det' (44), 'cc||cc' (40), 'appos||det' (39),
'advcl||conj' (33), 'nsubj:pass' (30), 'aux||conj' (29), 'nmod||cc' (29),
'nsubj||cc' (29), 'nmod||det' (24), 'det||conj' (22), 'orphan' (22), 'obl||cc'
(20), 'ccomp||cc' (19), 'advcl||cc' (18), 'cc||conj' (17), 'mark||ccomp' (16),
'mark||advcl' (15), 'case||cc' (15), 'acl||det' (12), 'punct||xcomp' (10),
'xcomp||cc' (10), 'punct||ccomp' (10), 'punct||case' (9), 'acl||cc' (9),
'advmod||ccomp' (9), 'mark||acl' (9), 'punct||advcl' (8), 'obl||xcomp' (8),
'punct||acl' (8), 'amod||cc' (7), 'nsubj||advcl' (7), 'fixed||case' (7),
'obl||advcl' (7), 'obl||ccomp' (7), 'iobj||conj' (6), 'mark||xcomp' (6),
'advmod||cc' (6), 'nsubj||acl' (6), 'case||cop' (6), 'appos||cc' (6),
'compound||det' (6), 'csubj||cc' (5), 'nsubj||ccomp' (5), 'mark||cc' (5),
'mark||csubj' (4), 'cop||cc' (4), 'advcl||parataxis' (3), 'advmod||xcomp' (3),
'appos||obj' (3), 'det||cc' (3), 'case||amod' (3), 'advmod||acl' (3), 'obl||acl'
(3), 'advmod||advcl' (2), 'nsubj||xcomp' (2), 'flat||obj' (2), 'nummod||det'
(2), 'appos||conj' (2), 'obj||ccomp' (2), 'advcl||ccomp' (2), 'case||nmod' (2),
'obl||parataxis' (2), 'case||flat' (2), 'det||advcl' (2), 'advmod||parataxis'
(2), 'cop||conj' (2), 'det||obj' (1), 'nmod||obj' (1), 'acl||obj' (1),
'ccomp||conj' (1), 'obl||csubj' (1), 'advmod||csubj' (1), 'punct||csubj' (1),
'cop||ccomp' (1), 'case||appos' (1), 'fixed||mark' (1), 'aux||xcomp' (1),
'obj||xcomp' (1), 'aux||cc' (1), 'csubj:pass' (1), 'punct||flat' (1), 'obj||acl'
(1), 'flat||appos' (1), 'punct||parataxis' (1), 'amod||det' (1), 'punct||aux'
(1), 'appos||advmod' (1), 'advcl||advcl' (1), 'csubj||conj' (1), 'mark||cop'
(1), 'compound||aux' (1), 'det||ccomp' (1), 'nmod||conj' (1), 'advcl||acl' (1),
'amod||nummod' (1), 'amod||conj' (1), 'case||conj' (1), 'case||det' (1),
'obj||nmod' (1), 'nsubj||csubj' (1), 'obj||csubj' (1), 'obj||parataxis' (1)

================================== Summary ==================================
✔ 9 checks passed
✘ 15 errors

For the dataset that I am trying to work on:

 >> python -m spacy debug-data en en-ud-tweet-train.jsonl en-ud-tweet-dev.jsonl 

=========================== Data format validation ===========================
✔ Loaded en-ud-tweet-train.jsonl
✔ Loaded en-ud-tweet-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'en'
1639 training docs
710 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train from a blank model (1639)

============================== Vocab & Vectors ==============================
ℹ 24753 total words in the data (8564 unique)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
24753 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace

=========================== Part-of-speech Tagging ===========================
ℹ 42 labels in data (57 labels in tag map)
✘ Label 'V_V' not found in tag map for language 'en'
✘ Label 'N_N' not found in tag map for language 'en'
✘ Label 'P_P' not found in tag map for language 'en'
✘ Label 'D_D' not found in tag map for language 'en'
✘ Label 'R_R' not found in tag map for language 'en'
✘ Label 'A_A' not found in tag map for language 'en'
✘ Label ',_,' not found in tag map for language 'en'
✘ Label 'X' not found in tag map for language 'en'
✘ Label 'PUNCT' not found in tag map for language 'en'
✘ Label 'NOUN' not found in tag map for language 'en'
✘ Label 'PRON' not found in tag map for language 'en'
✘ Label 'VERB' not found in tag map for language 'en'
✘ Label 'PART' not found in tag map for language 'en'
✘ Label 'ADP' not found in tag map for language 'en'
✘ Label 'CCONJ' not found in tag map for language 'en'
✘ Label 'ADJ' not found in tag map for language 'en'
✘ Label '~_~' not found in tag map for language 'en'
✘ Label '@_@' not found in tag map for language 'en'
✘ Label 'O_O' not found in tag map for language 'en'
✘ Label 'L_L' not found in tag map for language 'en'
✘ Label '&_&' not found in tag map for language 'en'
✘ Label '#_#' not found in tag map for language 'en'
✘ Label 'U_U' not found in tag map for language 'en'
✘ Label 'E_E' not found in tag map for language 'en'
✘ Label '!_!' not found in tag map for language 'en'
✘ Label 'PROPN' not found in tag map for language 'en'
✘ Label 'NUM' not found in tag map for language 'en'
✘ Label 'ADV' not found in tag map for language 'en'
✘ Label 'DET' not found in tag map for language 'en'
✘ Label 'AUX' not found in tag map for language 'en'
✘ Label 'INTJ' not found in tag map for language 'en'
✘ Label 'SCONJ' not found in tag map for language 'en'
✘ Label '^_^' not found in tag map for language 'en'
✘ Label '$_$' not found in tag map for language 'en'
✘ Label 'G_G' not found in tag map for language 'en'
✘ Label 'T_T' not found in tag map for language 'en'
✘ Label 'X_X' not found in tag map for language 'en'
✘ Label 'Z_Z' not found in tag map for language 'en'
✘ Label 'S_S' not found in tag map for language 'en'
✘ Label 'Y_Y' not found in tag map for language 'en'
✘ Label 'M_M' not found in tag map for language 'en'

============================= Dependency Parsing =============================
ℹ 57 labels in data

================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
✘ 41 errors
(factmata) ➜  Tweebank git:(dev) ✗ python -m spacy debug-data en en-ud-tweet-train.jsonl en-ud-tweet-dev.jsonl --verbose

=========================== Data format validation ===========================
✔ Loaded en-ud-tweet-train.jsonl
✔ Loaded en-ud-tweet-dev.jsonl
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'en'
1639 training docs
710 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train from a blank model (1639)
It's recommended to use at least 2000 examples (minimum 100)

============================== Vocab & Vectors ==============================
ℹ 24753 total words in the data (8564 unique)
10 most common words: ':' (773), 'RT' (638), '.' (593), 'I' (398), 'the' (394),
'to' (367), ',' (345), 'a' (276), '!' (271), 'you' (269)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
24753 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace

=========================== Part-of-speech Tagging ===========================
ℹ 42 labels in data (57 labels in tag map)
'NOUN' (2251), 'PUNCT' (2116), 'X' (1998), 'VERB' (1699), 'PRON' (1452), 'PROPN'
(1356), 'V_V' (1288), 'N_N' (1110), 'ADP' (1061), ',_,' (944), 'ADJ' (839),
'AUX' (789), 'DET' (728), 'ADV' (686), 'P_P' (671), 'O_O' (600), 'D_D' (484),
'^_^' (463), 'A_A' (439), '@_@' (412), 'R_R' (383), 'PART' (368), 'NUM' (285),
'~_~' (284), 'CCONJ' (272), 'SYM' (265), 'L_L' (239), '!_!' (225), 'INTJ' (160),
'SCONJ' (150), '$_$' (131), '&_&' (120), 'U_U' (118), 'E_E' (90), '#_#' (85),
'G_G' (69), 'T_T' (54), 'Z_Z' (43), 'X_X' (15), 'S_S' (8), 'M_M' (2), 'Y_Y' (1)
✘ Label 'V_V' not found in tag map for language 'en'
✘ Label 'R_R' not found in tag map for language 'en'
✘ Label 'A_A' not found in tag map for language 'en'
✘ Label 'P_P' not found in tag map for language 'en'
✘ Label 'D_D' not found in tag map for language 'en'
✘ Label ',_,' not found in tag map for language 'en'
✘ Label '#_#' not found in tag map for language 'en'
✘ Label 'ADJ' not found in tag map for language 'en'
✘ Label 'NOUN' not found in tag map for language 'en'
✘ Label 'PUNCT' not found in tag map for language 'en'
✘ Label 'X' not found in tag map for language 'en'
✘ Label 'PROPN' not found in tag map for language 'en'
✘ Label 'NUM' not found in tag map for language 'en'
✘ Label '@_@' not found in tag map for language 'en'
✘ Label '!_!' not found in tag map for language 'en'
✘ Label 'N_N' not found in tag map for language 'en'
✘ Label 'O_O' not found in tag map for language 'en'
✘ Label 'DET' not found in tag map for language 'en'
✘ Label 'PRON' not found in tag map for language 'en'
✘ Label 'AUX' not found in tag map for language 'en'
✘ Label 'CCONJ' not found in tag map for language 'en'
✘ Label 'ADP' not found in tag map for language 'en'
✘ Label '~_~' not found in tag map for language 'en'
✘ Label 'L_L' not found in tag map for language 'en'
✘ Label 'INTJ' not found in tag map for language 'en'
✘ Label 'VERB' not found in tag map for language 'en'
✘ Label 'SCONJ' not found in tag map for language 'en'
✘ Label '&_&' not found in tag map for language 'en'
✘ Label 'ADV' not found in tag map for language 'en'
✘ Label '$_$' not found in tag map for language 'en'
✘ Label 'G_G' not found in tag map for language 'en'
✘ Label 'E_E' not found in tag map for language 'en'
✘ Label '^_^' not found in tag map for language 'en'
✘ Label 'Z_Z' not found in tag map for language 'en'
✘ Label 'PART' not found in tag map for language 'en'
✘ Label 'U_U' not found in tag map for language 'en'
✘ Label 'T_T' not found in tag map for language 'en'
✘ Label 'X_X' not found in tag map for language 'en'
✘ Label 'M_M' not found in tag map for language 'en'
✘ Label 'S_S' not found in tag map for language 'en'
✘ Label 'Y_Y' not found in tag map for language 'en'

============================= Dependency Parsing =============================
ℹ 57 labels in data
'punct' (3250), 'ROOT' (2470), 'discourse' (2139), 'nsubj' (1940), 'case'
(1484), 'obj' (1188), 'advmod' (1132), 'det' (1046), 'obl' (855), 'amod' (796),
'compound' (785), 'list' (694), 'vocative' (615), 'cop' (582), 'mark' (577),
'aux' (557), 'parataxis' (522), 'nmod' (500), 'conj' (467), 'nmod:poss' (432),
'cc' (411), 'xcomp' (405), 'nummod' (298), 'advcl' (273), 'ccomp' (232), 'flat'
(190), 'compound:prt' (123), 'acl:relcl' (116), 'appos' (109), 'acl' (105),
'obl:tmod' (80), 'aux:pass' (48), 'nsubj:pass' (46), 'obl:npmod' (40), 'iobj'
(36), 'goeswith' (28), 'nmod:tmod' (27), 'det:predet' (25), 'csubj' (24),
'nmod:npmod' (23), 'expl' (23), 'flat:foreign' (18), 'fixed' (16), 'reparandum'
(8), 'appos||obj' (4), 'case||obl' (2), 'acl||obj' (2), 'cc:preconj' (1),
'case||nsubj:pass' (1), 'obl||amod' (1), 'conj||obl' (1), 'obj||xcomp' (1),
'dislocated||xcomp' (1), 'orphan' (1), 'dislocated' (1), 'cc||conj' (1),
'conj||aux' (1)

================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
✘ 41 errors

How come the tags VERB, PRON, PROPN are labelled as not present in tag map for en?

gustavengstrom commented 5 years ago

I am getting the exact same error message. Working in Python 3.7.1. Mac OSX 10.14.4. Spacy 2.1.3

juliamakogon commented 5 years ago

I think I have the same error trying to evaluate model (with spacy 2.1.3 and 2.1.1 also):

python -m spacy convert /home/julm/work/data/ud/UD_English-EWT-master/en_ewt-ud-test.conllu ud_en_ewt -n 10 -l en python -m spacy evaluate en /home/julm/work/data/ud/ud_en_ewt/en_ewt-ud-test.jsonl`

Traceback (most recent call last): File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module> plac.call(commands[command], sys.argv[1:]) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/home/julm/anaconda3/envs/python3_spacy/lib/python3.6/site-packages/spacy/cli/evaluate.py", line 44, in evaluate corpus = GoldCorpus(data_path, data_path) File "gold.pyx", line 112, in spacy.gold.GoldCorpus.__init__ File "gold.pyx", line 125, in spacy.gold.GoldCorpus.write_msgpack KeyError: 1 By the way, evaluate works with UD Ukrainian json I've converted with spacy 2.0 (but with code change in convert)

gustavengstrom commented 5 years ago

@ines are you able to reproduce this error?? Seems like we all have are getting this error from running the code in the official documentation https://spacy.io/usage/training#spacy-train-cli The repository UniversalDependencies/UD_Spanish-AnCora has not been updated during the last 5 months so maybe the error is related to a spacy version upgrade?!

ines commented 5 years ago

How come the tags VERB, PRON, PROPN are labelled as not present in tag map for en?

The tag map maps fine-grained tags from the treebank to coarse-grained universal tags – for example, in English, it'd map something like VBZ to VERB. In your treebank, the universal tags plus a bunch of custom ones are used. So essentially, you need to provide spaCy with a custom tag mao that matches the tag set in the data.

There are a few todos on this topic for us:

eivindbergem commented 5 years ago

I get the same error with UD_Norwegian-Bokmaal using spacy 2.1.3.

I have json-files generated with "spacy convert" from spacy 2.0.18 and I'm able to use these to train without error.

honnibal commented 5 years ago

I think the issue is that spacy convert changed behaviour. We should change it back for the next release.

spacy convert is currently defaulting to jsonl-formatted output, which is causing problems. If you set -t json it should work I think.

gustavengstrom commented 5 years ago

@honnibal : Thanks for looking into this so quickly! Problem solved!

xssChauhan commented 5 years ago

@honnibal Thank you for pointing it out. I would like to help out with this issue. Any references that can help me get started?

ines commented 5 years ago

@xssChauhan I think it could be as straightforward as changing the default value from "jsonl" to "json" here:

https://github.com/explosion/spaCy/blob/4d198a7e92f813fb9df2ade72fbeaf847284a7a0/spacy/cli/convert.py#L42

If you'd like to test this and submit a PR, that'd be great!

xssChauhan commented 5 years ago

@ines Opened #3583 to fix this.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.