explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

init_tok2vec does not seem to have an impact on model's training #6952

Closed cverluise closed 3 years ago

cverluise commented 3 years ago

Hello,

first of all, a huge thanks for all your work and the new features supported by the v3. I'm sure it'll make a 💥 for the community in the coming months!

How to reproduce the behaviour

I'm trying to train a ner component with pre-trained vectors. Here is what I do (config files below):

spacy pretrain en_t2vner_pretraining.cfg en_gbpatentxx --paths.raw_text raw_gbpatentxx.jsonl

As expected, this produces the model*.bin(and runs for a very long time)

Next, I compare the output with and without the pre-trained vectors.

# without pre-trained vectors
spacy train en_t2vner.cfg --paths.train train_gpatent01.spacy --paths.dev test_gpatent01.spacy 

# with pre-trained vectors
spacy train en_t2vner.cfg --paths.train train_gpatent01.spacy --paths.dev test_gpatent01.spacy --paths.init_tok2vec en_gbpatentxx/model999.bin

# Nb: dry run

I get the exact same sequence of loss and everything, suggesting that the init_tok2vec does not change anything. Am I doing something wrong or is there a bug?

Thanks in advance,

Cyril

en_t2vner.cfg ```cfg [paths] train = null dev = null vectors = null init_tok2vec = null [system] gpu_allocator = null seed = 0 [nlp] lang = "en" pipeline = ["tok2vec","ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} [components] [components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100 [components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null [components.ner.model.tok2vec] @architectures = "spacy.Tok2VecListener.v1" width = ${components.tok2vec.model.encode.width} upstream = "*" [components.tok2vec] factory = "tok2vec" [components.tok2vec.model] @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v1" width = ${components.tok2vec.model.encode.width} attrs = ["NORM","PREFIX","SUFFIX","SHAPE"] rows = [5000,2500,2500,2500] include_static_vectors = false [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" width = 96 depth = 4 window_size = 1 maxout_pieces = 3 [corpora] [corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null [training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null [training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null [training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0 [training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false [training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001 [training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0 [pretraining] [initialize] vectors = null init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null [initialize.components] [initialize.tokenizer] ```
en_t2vner_pretraining.cfg ```cfg [paths] train = null dev = null vectors = null init_tok2vec = null raw_text = null [system] gpu_allocator = null seed = 0 [nlp] lang = "en" pipeline = ["tok2vec","ner"] batch_size = 1000 disabled = [] before_creation = null after_creation = null after_pipeline_creation = null tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} [components] [components.ner] factory = "ner" moves = null update_with_oracle_cut_size = 100 [components.ner.model] @architectures = "spacy.TransitionBasedParser.v2" state_type = "ner" extra_state_tokens = false hidden_width = 64 maxout_pieces = 2 use_upper = true nO = null [components.ner.model.tok2vec] @architectures = "spacy.Tok2VecListener.v1" width = ${components.tok2vec.model.encode.width} upstream = "*" [components.tok2vec] factory = "tok2vec" [components.tok2vec.model] @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v1" width = ${components.tok2vec.model.encode.width} attrs = ["NORM","PREFIX","SUFFIX","SHAPE"] rows = [5000,2500,2500,2500] include_static_vectors = false [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" width = 96 depth = 4 window_size = 1 maxout_pieces = 3 [corpora] [corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 gold_preproc = false limit = 0 augmenter = null [corpora.pretrain] @readers = "spacy.JsonlCorpus.v1" path = ${paths.raw_text} min_length = 5 max_length = 500 limit = 0 [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 gold_preproc = false limit = 0 augmenter = null [training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 patience = 1600 max_epochs = 0 max_steps = 20000 eval_frequency = 200 frozen_components = [] before_to_disk = null [training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 get_length = null [training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 t = 0.0 [training.logger] @loggers = "spacy.ConsoleLogger.v1" progress_bar = false [training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 0.00000001 learn_rate = 0.001 [training.score_weights] ents_per_type = null ents_f = 1.0 ents_p = 0.0 ents_r = 0.0 [pretraining] max_epochs = 1000 dropout = 0.2 n_save_every = null component = "tok2vec" layer = "" corpus = "corpora.pretrain" [pretraining.batcher] @batchers = "spacy.batch_by_words.v1" size = 3000 discard_oversize = false tolerance = 0.2 get_length = null [pretraining.objective] @architectures = "spacy.PretrainCharacters.v1" maxout_pieces = 3 hidden_size = 300 n_characters = 4 [pretraining.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = true eps = 0.00000001 learn_rate = 0.001 [initialize] vectors = null init_tok2vec = ${paths.init_tok2vec} vocab_data = null lookups = null before_init = null after_init = null [initialize.components] [initialize.tokenizer] ```

Your Environment

AndriyMulyar commented 3 years ago

It appears that tok2vec is only loaded from an initialization if their is a pretraining section in the config: https://github.com/explosion/spaCy/blob/6ed423c16c99206ff2b81176d9565d0e1c1b7071/spacy/language.py#L1226

Update: after some debugging and reading it seems that this is expected behavior. First initialize a standard tok2vec layer + [ner, parser, classification] layer config. Then, fill/update the config with the --pretraining CLI argument as described here. You should use the generated pretrain config file for both pretraining and further training. This is done to insure the base tok2vec encoder is the same between pre-training and finetuning.

Unfortunately, I still get a config validation error when using the pretrain filled config with the spacy train CLI command:

✘ Config validation error
pretraining -> optimizer      instance of Optimizer expected
pretraining -> objective      Promise(registry='architectures', name='spacy.PretrainCharacters.v1', args=[], kwargs={'maxout_pieces': 3, 'hidden_size': 300, 'n_characters': 4}) is not callable
pretraining -> batcher        extra fields not permitted    
pretraining -> component      extra fields not permitted    
pretraining -> corpus         extra fields not permitted    
pretraining -> dropout        extra fields not permitted    
pretraining -> layer          extra fields not permitted    
pretraining -> max_epochs     extra fields not permitted    
pretraining -> n_save_every   extra fields not permitted    
pretraining -> objective      extra fields not permitted    
pretraining -> optimizer      extra fields not permitted
svlandeg commented 3 years ago

Thanks @AndriyMulyar, that looks like a potential bug, if you've just used a default config file. I'll have a look.

svlandeg commented 3 years ago

Hi @cverluise and @AndriyMulyar: it turns out that the spaCy v3 pretraining functionality was a bit broken. There was a bug in the schema validation, and an incorrect ordering of initializing weights also meant that the pretrained tok2vec weights weren't actually used.

Apologies for any inconvenience caused! I do hope you'll be able to give the pretraining functionality another spin when we've released the bug fix (hopefully soon)!

GladiatorX commented 3 years ago

Hello I have Pretrain the “token to vector” (Tok2vec) layer of pipeline components on raw text using

python -m spacy pretrain config_pretrain.cfg ./output --paths.raw_text text.jsonl 

it successfully produces .bin file

Now when I compare my model output with and without the pre-trained vectors, the result seems to be exactly identical.

without reference to the .bin file before

with reference to the .bin file afterPretrining

Am I missing out something here? @svlandeg I'll be glad if you can have a look.. Thanks.

polm commented 3 years ago

@GladiatorX

  1. Don't @ maintainers to get their attention
  2. Don't post screenshots of text, copy/paste the text
  3. Include your config and the output of spacy info

The main thing to check is if you have include_static_vectors = True in your config. However, based on your scores, it looks like you may have a tiny or simple dataset and the model is just overfitting immediately anyway.

GladiatorX commented 2 years ago

I just missed to add include_static_vectors = True, It does work now , Thanks.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.