I have written code to package the training annotations as Example objects and then, added those Examples to a ner, using the EntityRecognizer.initialize() method. A snippet of code used to create the ner follows:
examples, nlp = make_examples()
ner = nlp.get_pipe("ner")
ner.add_label("mylabel")
get_examples = lambda: iter(list(examples.values())[0])
ner.initialize(get_examples, nlp=nlp)
ner.to_disk("path to disk")
The code in Transition based parser parent class for EntityRecognizer shows the Examples added to the model in the initialize() method, so serialization of the ner should contain those Examples.
However, when I run spacy train, specifying the directory of the serialized ner, I receive error 923:
[E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues
It appears to want to initialize the Examples all over again, when they should already be there.
The spacy train log and error and the contents of the config follow below. The config was was generated using ifill init config and I commented out those lines that were not related to an ner pipeline component.
There is also this message in the log: UserWarning: [W090] Could not locate any .spacy files in path 'ner_.spacy'. What exactly is it looking for?
The text of the error indicates uncertainty as to where the error actually lies - is there truly a bug in spacy or, is there something I missed? Right now, this is a hard stop. Can you please help?
Environment
$ python -m spacy info --markdown
Info about spaCy
spaCy version: 3.0.1
Platform: Windows-2012ServerR2-6.3.9600-SP0
Python version: 3.7.6
Pipelines: en_core_web_sm (3.0.0)
The log and error generated by spacy train:
$ python -m spacy train config.cfg --paths.train ./ner.spacy --paths.dev ./ner.spacy
Set up nlp object from config
Pipeline: ['ner']
Created vocabulary
Added vectors: en_core_websm
Finished initializing nlp object
D:\ProgramData...\lib\site-packages\spacy\training\corpus.py:76: UserWarning: [W090] Could not locate any .spacy files in path 'ner.spacy'.
warnings.warn(Warnings.W090.format(path=orig_path, format=file_type))
Traceback (most recent call last):
File "D:\ProgramData\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "D:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:...\lib\site-packages\spacy__main.py", line 4, in
setup_cli()
File "D:...\lib\site-packages\spacy\cli_util.py", line 68, in setup_cli
command(prog_name=COMMAND)
File "D:...\lib\site-packages\click\core.py", line 829, in call__
return self.main(args, kwargs)
File "D:...\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "D:....\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "D:...\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "D:...\lib\site-packages\click\core.py", line 610, in invoke
return callback(args, kwargs)
File "D:...\lib\site-packages\typer\main.py", line 497, in wrapper
return callback(use_params) # type: ignore
File "D:...\lib\site-packages\spacy\cli\train.py", line 56, in train_cli
nlp = init_nlp(config, use_gpu=use_gpu)
File "D:...\lib\site-packages\spacy\training\initialize.py", line 70, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "D:...\lib\site-packages\spacy\language.py", line 1246, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "spacy\pipeline\transition_parser.pyx", line 530, in spacy.pipeline.transition_parser.Parser.initialize
AssertionError: [E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues
[i] Using CPU
Steps I used that produced this error:
I am using spacy v3 to train a model.
I have written code to package the training annotations as Example objects and then, added those Examples to a ner, using the EntityRecognizer.initialize() method. A snippet of code used to create the ner follows:
examples, nlp = make_examples() ner = nlp.get_pipe("ner") ner.add_label("mylabel") get_examples = lambda: iter(list(examples.values())[0]) ner.initialize(get_examples, nlp=nlp) ner.to_disk("path to disk")
The code in Transition based parser parent class for EntityRecognizer shows the Examples added to the model in the initialize() method, so serialization of the ner should contain those Examples.
However, when I run spacy train, specifying the directory of the serialized ner, I receive error 923: [E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues
It appears to want to initialize the Examples all over again, when they should already be there.
The spacy train log and error and the contents of the config follow below. The config was was generated using ifill init config and I commented out those lines that were not related to an ner pipeline component.
There is also this message in the log: UserWarning: [W090] Could not locate any .spacy files in path 'ner_.spacy'. What exactly is it looking for?
The text of the error indicates uncertainty as to where the error actually lies - is there truly a bug in spacy or, is there something I missed? Right now, this is a hard stop. Can you please help?
Environment
$ python -m spacy info --markdown
Info about spaCy
The log and error generated by spacy train:
$ python -m spacy train config.cfg --paths.train ./ner.spacy --paths.dev ./ner.spacy
setup_cli()
File "D:...\lib\site-packages\spacy\cli_util.py", line 68, in setup_cli
command(prog_name=COMMAND)
File "D:...\lib\site-packages\click\core.py", line 829, in call__
return self.main(args, kwargs)
File "D:...\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "D:....\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "D:...\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "D:...\lib\site-packages\click\core.py", line 610, in invoke
return callback(args, kwargs)
File "D:...\lib\site-packages\typer\main.py", line 497, in wrapper
return callback(use_params) # type: ignore
File "D:...\lib\site-packages\spacy\cli\train.py", line 56, in train_cli
nlp = init_nlp(config, use_gpu=use_gpu)
File "D:...\lib\site-packages\spacy\training\initialize.py", line 70, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "D:...\lib\site-packages\spacy\language.py", line 1246, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "spacy\pipeline\transition_parser.pyx", line 530, in spacy.pipeline.transition_parser.Parser.initialize
AssertionError: [E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues
[i] Using CPU
Set up nlp object from config Pipeline: ['ner'] Created vocabulary Added vectors: en_core_websm Finished initializing nlp object D:\ProgramData...\lib\site-packages\spacy\training\corpus.py:76: UserWarning: [W090] Could not locate any .spacy files in path 'ner.spacy'. warnings.warn(Warnings.W090.format(path=orig_path, format=file_type)) Traceback (most recent call last): File "D:\ProgramData\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:...\lib\site-packages\spacy__main.py", line 4, in
=========================== Initializing pipeline ===========================
The spacy train config file contents: [paths] train = null dev = null vectors = null init_tok2vec = null
[system] gpu_allocator = null seed = 0
[nlp] lang = "en" pipeline = ["ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100
[components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3
[corpora]
[corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null
[corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null
[training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null
[training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null
[training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0
[training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false
[training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001
[training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0
[pretraining]
[initialize] vectors = "en_core_web_sm" init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null
[initialize.components]
[initialize.tokenizer]