explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

ValueError in "spacy/pipeline/_parser_internals/ner.pyx", line 310, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs #6984

Closed snthibaud closed 3 years ago

snthibaud commented 3 years ago

How to reproduce the behaviour

I was trying to train a NER with the following config:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ja"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = null

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Then I encountered the following stacktrace:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/cli/_util.py", line 68, in setup_cli
    command(prog_name=COMMAND)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/training/loop.py", line 114, in train
    raise e
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/training/loop.py", line 194, in train_while_improving
    nlp.update(
  File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/python_38/lib/python3.8/site-packages/spacy/language.py", line 1106, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy/pipeline/transition_parser.pyx", line 366, in spacy.pipeline.transition_parser.Parser.update
  File "spacy/pipeline/transition_parser.pyx", line 478, in spacy.pipeline.transition_parser.Parser.get_batch_loss
  File "spacy/pipeline/_parser_internals/ner.pyx", line 310, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs
ValueError

The number of documents could be a bit high (~500.000).

Info about spaCy

adrianeboyd commented 3 years ago

Hmm, I'm not entirely sure at this point, but this kind of error can indicate that there's not enough (usable) training data. My first guess would be that your NER annotation might not align well with the tokenization from the JapaneseTokenizer, which uses sudachipy.

What is the output of the NER section for spacy debug data -V config.cfg (with the corpus path options as necessary)?

no-response[bot] commented 3 years ago

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

imohitmayank commented 3 years ago

I was facing similar issue, here is what I did -- as suggested by @adrianeboyd , a quick run of spacy debug data -V /content/config.cfg --paths.train /content/train.spacy --paths.dev /content/eval.spacy showcased several warnings and 1 error. Something like this: image In my case, the error was due to 1 entity having trailing spaces. A simple .strip() on the entity resolved the issue.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.