explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

spacy train attempts to reload Examples that were already supplied to ner. #7163

Closed source19069 closed 3 years ago

source19069 commented 3 years ago

Steps I used that produced this error:

I am using spacy v3 to train a model.

I have written code to package the training annotations as Example objects and then, added those Examples to a ner, using the EntityRecognizer.initialize() method. A snippet of code used to create the ner follows:

examples, nlp = make_examples() ner = nlp.get_pipe("ner") ner.add_label("mylabel") get_examples = lambda: iter(list(examples.values())[0]) ner.initialize(get_examples, nlp=nlp) ner.to_disk("path to disk")

The code in Transition based parser parent class for EntityRecognizer shows the Examples added to the model in the initialize() method, so serialization of the ner should contain those Examples.

However, when I run spacy train, specifying the directory of the serialized ner, I receive error 923: [E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues

It appears to want to initialize the Examples all over again, when they should already be there.

The spacy train log and error and the contents of the config follow below. The config was was generated using ifill init config and I commented out those lines that were not related to an ner pipeline component.

There is also this message in the log: UserWarning: [W090] Could not locate any .spacy files in path 'ner_.spacy'. What exactly is it looking for?

The text of the error indicates uncertainty as to where the error actually lies - is there truly a bug in spacy or, is there something I missed? Right now, this is a hard stop. Can you please help?

Environment

$ python -m spacy info --markdown

Info about spaCy

The log and error generated by spacy train:

$ python -m spacy train config.cfg --paths.train ./ner.spacy --paths.dev ./ner.spacy
Set up nlp object from config Pipeline: ['ner'] Created vocabulary Added vectors: en_core_websm Finished initializing nlp object D:\ProgramData...\lib\site-packages\spacy\training\corpus.py:76: UserWarning: [W090] Could not locate any .spacy files in path 'ner.spacy'. warnings.warn(Warnings.W090.format(path=orig_path, format=file_type)) Traceback (most recent call last): File "D:\ProgramData\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:...\lib\site-packages\spacy__main.py", line 4, in setup_cli() File "D:...\lib\site-packages\spacy\cli_util.py", line 68, in setup_cli command(prog_name=COMMAND) File "D:...\lib\site-packages\click\core.py", line 829, in call__ return self.main(args, kwargs) File "D:...\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "D:....\lib\site-packages\click\core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "D:...\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "D:...\lib\site-packages\click\core.py", line 610, in invoke return callback(args, kwargs) File "D:...\lib\site-packages\typer\main.py", line 497, in wrapper return callback(use_params) # type: ignore File "D:...\lib\site-packages\spacy\cli\train.py", line 56, in train_cli nlp = init_nlp(config, use_gpu=use_gpu) File "D:...\lib\site-packages\spacy\training\initialize.py", line 70, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "D:...\lib\site-packages\spacy\language.py", line 1246, in initialize proc.initialize(get_examples, nlp=self, **p_settings) File "spacy\pipeline\transition_parser.pyx", line 530, in spacy.pipeline.transition_parser.Parser.initialize AssertionError: [E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. This is likely a bug in spaCy, so feel free to open an issue: https://github.com/explosion/spaCy/issues [i] Using CPU

=========================== Initializing pipeline ===========================

The spacy train config file contents: [paths] train = null dev = null vectors = null init_tok2vec = null

[system] gpu_allocator = null seed = 0

[nlp] lang = "en" pipeline = ["ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100

[components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null

[components.ner.model.tok2vec]

@architectures = "spacy.Tok2VecListener.v1"

width = ${components.tok2vec.model.encode.width}

upstream = "*"

[components.tok2vec]

factory = "tok2vec"

[components.tok2vec.model]

@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]

@architectures = "spacy.MultiHashEmbed.v1"

width = ${components.tok2vec.model.encode.width}

attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]

rows = [5000,2500,2500,2500]

include_static_vectors = true

[components.tok2vec.model.encode]

@architectures = "spacy.MaxoutWindowEncoder.v2"

width = 256

depth = 8

window_size = 1

maxout_pieces = 3

[corpora]

[corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null

[corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null

[training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null

[training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null

[training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0

[training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false

[training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001

[training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0

[pretraining]

[initialize] vectors = "en_core_web_sm" init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null

[initialize.components]

[initialize.tokenizer]

source19069 commented 3 years ago

I determined why this happening, closing. Thank you.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.