explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.57k stars 4.35k forks source link

ValueError: [E084] Error assigning label ID 4317129024397789502 to span: not in StringStore. #6868

Closed baiziyuandyufei closed 3 years ago

baiziyuandyufei commented 3 years ago

How to reproduce the behaviour

  1. generate the base_config.cfg file, reference https://nightly.spacy.io/usage/training#quickstart
[paths]
train = null
dev = null

[system]
gpu_allocator = "pytorch"

[nlp]
lang = "zh"
pipeline = ["transformer","ner"]
tokenizer = {"@tokenizers": "spacy.Tokenizer.v1"}
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-chinese"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = null
  1. fill the base_config.cft to config.cfg

python -m spacy init fill-config base_config.cfg config.cfg

  1. prepare the ner-token-perline.iob corpus
海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O

这 O
座 O
依 O
山 O
傍 O
水 O
的 O
  1. convert the corpus to spacy format

python -m spacy convert input_file_path output_file_path -c iob -n 10

  1. train the model

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

  1. test the model

the test_code:

import spacy
from spacy.lang.zh import Chinese

nlp = spacy.load("output/model-best")
nlp.tokenizer = Chinese().tokenizer
doc = nlp("红楼中的贾宝玉")

the error is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/git_code/2021-?ʷ?????/spacy/spaCy-3.0.0rc5/spacy/language.py", line 985, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "spacy/pipeline/transition_parser.pyx", line 168, in spacy.pipeline.transition_parser.Parser.__call__
  File "spacy/pipeline/transition_parser.pyx", line 278, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy/pipeline/_parser_internals/ner.pyx", line 243, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_annotations
  File "spacy/tokens/span.pyx", line 105, in spacy.tokens.span.Span.__cinit__
ValueError: [E084] Error assigning label ID 4317129024397789502 to span: not in StringStore.

Your Environment

adrianeboyd commented 3 years ago

Try leaving out this line? The default tokenizer should be the character-based one and replacing it could be leading to the problems with the vocab:

nlp.tokenizer = Chinese().tokenizer
baiziyuandyufei commented 3 years ago

@adrianeboyd if don't set nlp.tokenizer, then the doc = nlp("红楼中的贾宝玉") print([token for token in doc]) will output ['红楼中的贾宝玉']

adrianeboyd commented 3 years ago

Hmm, the tokenizer should be configured correctly in the config file or mismatches between training vs. runtime tokenization will cause problems with the models anyway. What does the evaluation look like while training?

adrianeboyd commented 3 years ago

Ah, I think the tokenizer config is incorrect, so I doubt the model trained correctly, either. The config should look like this:

[nlp]
lang = "zh"

...

[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char"

...

[initialize.tokenizer]
pkuseg_model = null
pkuseg_user_dict = "default"

We need to fix the quickstart here, it looks like it doesn't handle the language-specific tokenizers correctly.

If you want to see the basics for the tokenizer config, you can also try this:

from spacy.lang.zh import Chinese
nlp = Chinese()
print(nlp.config.to_str())
adrianeboyd commented 3 years ago

Okay, #6870 should fix this in the quickstart template. The changes should be on the website later today, thanks for the bug report!

baiziyuandyufei commented 3 years ago

@adrianeboyd Thank you! after modify the base_config.cfg generated by https://spacy.io/usage/training, then it work well. The modified base_config.cfg below

[paths]
train = null
dev = null

[system]
gpu_allocator = "pytorch"

[nlp]
lang = "zh"
pipeline = ["transformer","ner"]
# tokenizer = {"@tokenizers": "spacy.Tokenizer.v1"}
batch_size = 128

[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = char

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-chinese"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = null

[initialize.tokenizer]
pkuseg_model = null
pkuseg_user_dict = "default"
github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.