Closed cverluise closed 3 years ago
It appears that tok2vec is only loaded from an initialization if their is a pretraining
section in the config:
https://github.com/explosion/spaCy/blob/6ed423c16c99206ff2b81176d9565d0e1c1b7071/spacy/language.py#L1226
Update: after some debugging and reading it seems that this is expected behavior. First initialize a standard tok2vec layer + [ner, parser, classification] layer config. Then, fill/update the config with the --pretraining
CLI argument as described here. You should use the generated pretrain config file for both pretraining and further training. This is done to insure the base tok2vec encoder is the same between pre-training and finetuning.
Unfortunately, I still get a config validation error when using the pretrain filled config with the spacy train
CLI command:
✘ Config validation error
pretraining -> optimizer instance of Optimizer expected
pretraining -> objective Promise(registry='architectures', name='spacy.PretrainCharacters.v1', args=[], kwargs={'maxout_pieces': 3, 'hidden_size': 300, 'n_characters': 4}) is not callable
pretraining -> batcher extra fields not permitted
pretraining -> component extra fields not permitted
pretraining -> corpus extra fields not permitted
pretraining -> dropout extra fields not permitted
pretraining -> layer extra fields not permitted
pretraining -> max_epochs extra fields not permitted
pretraining -> n_save_every extra fields not permitted
pretraining -> objective extra fields not permitted
pretraining -> optimizer extra fields not permitted
Thanks @AndriyMulyar, that looks like a potential bug, if you've just used a default config file. I'll have a look.
Hi @cverluise and @AndriyMulyar: it turns out that the spaCy v3 pretraining functionality was a bit broken. There was a bug in the schema validation, and an incorrect ordering of initializing weights also meant that the pretrained tok2vec
weights weren't actually used.
Apologies for any inconvenience caused! I do hope you'll be able to give the pretraining functionality another spin when we've released the bug fix (hopefully soon)!
Hello I have Pretrain the “token to vector” (Tok2vec) layer of pipeline components on raw text using
python -m spacy pretrain config_pretrain.cfg ./output --paths.raw_text text.jsonl
it successfully produces .bin file
Now when I compare my model output with and without the pre-trained vectors, the result seems to be exactly identical.
without reference to the .bin file
with reference to the .bin file
Am I missing out something here? @svlandeg I'll be glad if you can have a look.. Thanks.
@GladiatorX
spacy info
The main thing to check is if you have include_static_vectors = True
in your config. However, based on your scores, it looks like you may have a tiny or simple dataset and the model is just overfitting immediately anyway.
I just missed to add include_static_vectors = True
,
It does work now , Thanks.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hello,
first of all, a huge thanks for all your work and the new features supported by the v3. I'm sure it'll make a 💥 for the community in the coming months!
How to reproduce the behaviour
I'm trying to train a
ner
component with pre-trained vectors. Here is what I do (config files below):As expected, this produces the
model*.bin
(and runs for a very long time)Next, I compare the output with and without the pre-trained vectors.
I get the exact same sequence of loss and everything, suggesting that the
init_tok2vec
does not change anything. Am I doing something wrong or is there a bug?Thanks in advance,
Cyril
en_t2vner.cfg
```cfg [paths] train = null dev = null vectors = null init_tok2vec = null [system] gpu_allocator = null seed = 0 [nlp] lang = "en" pipeline = ["tok2vec","ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} [components] [components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100 [components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null [components.ner.model.tok2vec] @architectures = "spacy.Tok2VecListener.v1" width = ${components.tok2vec.model.encode.width} upstream = "*" [components.tok2vec] factory = "tok2vec" [components.tok2vec.model] @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v1" width = ${components.tok2vec.model.encode.width} attrs = ["NORM","PREFIX","SUFFIX","SHAPE"] rows = [5000,2500,2500,2500] include_static_vectors = false [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" width = 96 depth = 4 window_size = 1 maxout_pieces = 3 [corpora] [corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null [training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null [training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null [training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0 [training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false [training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001 [training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0 [pretraining] [initialize] vectors = null init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null [initialize.components] [initialize.tokenizer] ```en_t2vner_pretraining.cfg
```cfg [paths] train = null dev = null vectors = null init_tok2vec = null raw_text = null [system] gpu_allocator = null seed = 0 [nlp] lang = "en" pipeline = ["tok2vec","ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} [components] [components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100 [components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null [components.ner.model.tok2vec] @architectures = "spacy.Tok2VecListener.v1" width = ${components.tok2vec.model.encode.width} upstream = "*" [components.tok2vec] factory = "tok2vec" [components.tok2vec.model] @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v1" width = ${components.tok2vec.model.encode.width} attrs = ["NORM","PREFIX","SUFFIX","SHAPE"] rows = [5000,2500,2500,2500] include_static_vectors = false [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" width = 96 depth = 4 window_size = 1 maxout_pieces = 3 [corpora] [corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null [corpora.pretrain] @readers = "spacy.JsonlCorpus.v1" path = ${paths.raw_text} min_length = 5 max_length = 500 limit = 0 [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null [training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null [training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null [training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0 [training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false [training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001 [training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0 [pretraining] max_epochs = 1000 dropout = 0.2 n_save_every = null component = "tok2vec" layer = "" corpus = "corpora.pretrain" [pretraining.batcher] @batchers = "spacy.batch_by_words.v1" size = 3000 discard_oversize = false tolerance = 0.2 get_length = null [pretraining.objective] @architectures = "spacy.PretrainCharacters.v1" maxout_pieces = 3 hidden_size = 300 n_characters = 4 [pretraining.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = true eps = 0.00000001 learn_rate = 0.001 [initialize] vectors = null init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null [initialize.components] [initialize.tokenizer] ```Your Environment