Sentencizer is not called on update during training

ANenashev commented 3 years ago

How to reproduce the behaviour

I trying to train custom text classifier on top of BERT embeddings. I use spacy-transformers.sent_spans.v1 which requires sentence boundaries to be set. I added sentencizer to the beginning of pipeline. I ran python -m spacy train training/config.cfg --output en_clf -c ./bert_clauses_classifier/clf_pipe.py -V with following config:

[paths]
train = train.spacy
dev = eval.spacy
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["sentencizer","transformer","dev_bert_clauses_classifier_ref"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.sentencizer]
factory = "sentencizer"
punct_chars = null

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "nlpaueb/legal-bert-base-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.sent_spans.v1"

[components.dev_bert_clauses_classifier_ref]
factory = "dev_bert_clauses_classifier_ref"
labels_limit = 4

[components.dev_bert_clauses_classifier_ref.model]
@architectures = "dev_clauses_classifier_model.v1"

[components.dev_bert_clauses_classifier_ref.model.create_clauses_tensors]
@architectures = "dev_clause_tensor.v1"

[components.dev_bert_clauses_classifier_ref.model.create_clauses_tensors.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 0.0

[components.dev_bert_clauses_classifier_ref.model.create_clauses_tensors.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.dev_bert_clauses_classifier_ref.model.create_clauses_tensors.get_clauses]
@span_getters = "spacy-transformers.doc_spans.v1"

[components.dev_bert_clauses_classifier_ref.model.classifier_model]
@architectures = "dev_lstm_classifier_model.v1"
embeddings_dim = 768
rnn_hidden_dim = 100
nO = 24
bidirectional = True

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1000
max_epochs = 0
max_steps = 4000
eval_frequency = 200
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = true
get_length = null
size = 1500

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
sent_p = 0.0
sent_r = 0.0
sent_f = 0.0

[pretraining]

[initialize]

[initialize_components]

I'm getting following error:

ℹ Using CPU

=========================== Initializing pipeline ===========================
Set up nlp object from config
Loading corpus from path: data/clauses_classification_common_eval.spacy
Loading corpus from path: data/clauses_classification_common_train.spacy
Pipeline: ['sentencizer', 'transformer', 'dev_bert_clauses_classifier_ref']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: ['sentencizer', 'transformer', 'dev_bert_clauses_classifier_ref']
Loading corpus from path: data/clauses_classification_common_eval.spacy
Loading corpus from path: data/clauses_classification_common_train.spacy
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'transformer',
'dev_bert_clauses_classifier_ref']
ℹ Initial learn rate: 0.001
E    #       LOSS TRANS...  LOSS DEV_B...  SENTS_F  SENTS_P  SENTS_R  CATS_SCORE  SCORE 
---  ------  -------------  -------------  -------  -------  -------  ----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
ValueError("[E030] Sentence boundaries unset. You can add the 'sentencizer'
component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively,
add the dependency parser or sentence recognizer, or set sentence boundaries by
setting `doc[i].is_sent_start`.")
Traceback (most recent call last):
  File "/home/nenashevas/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/nenashevas/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/cli/_util.py", line 68, in setup_cli
    command(prog_name=COMMAND)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/training/loop.py", line 114, in train
    raise e
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/training/loop.py", line 195, in train_while_improving
    subbatch, drop=dropout, losses=losses, sgd=False, exclude=exclude
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy/language.py", line 1109, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy_transformers/pipeline_component.py", line 286, in update
    trf_full, bp_trf_full = self.model.begin_update(docs)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/thinc/model.py", line 306, in begin_update
    return self._func(self, X, is_train=True)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy_transformers/layers/transformer_model.py", line 123, in forward
    nested_spans = get_spans(docs)
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy_transformers/span_getters.py", line 48, in get_sent_spans
    return [list(doc.sents) for doc in docs]
  File "/home/nenashevas/.local/share/virtualenvs/bert-clauses-classifier-t7CyBQ2n/lib/python3.7/site-packages/spacy_transformers/span_getters.py", line 48, in <listcomp>
    return [list(doc.sents) for doc in docs]
  File "spacy/tokens/doc.pyx", line 856, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

Process finished with exit code 1

I noticed that sentencizer is not called here on example's predicted instance because it hasn't update method.

Please advise walkaround of this issue.

Your Environment

Info about spaCy

spaCy version: 3.0.3
Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Python version: 3.7.4

adrianeboyd commented 3 years ago

The workaround is to annotate your training corpus with sentencizer in advance:

import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
docs = DocBin().from_disk("train.spacy").get_docs(nlp.vocab)
docs = nlp.get_pipe("sentencizer").pipe(docs)
new_db = DocBin(docs=docs)
new_db.to_disk("train_with_sents.spacy")

It is a known issue that pipelines where one component depends on the annotation from an earlier component aren't supported at all in the current training setup. We're planning to add a [training] option to support this in v3.1.

In the provided pipelines we use strided_spans instead of sent_spans so sentence annotation isn't required during training, which could also be an alternative.

ANenashev commented 3 years ago

Thank you for quick reply!

Unfortunately, this method is not working for me. I see that sentence boundaries are set in reference doc of example instance and are missing in predicted doc. I'll try strided spans for now.

adrianeboyd commented 3 years ago

Ah, sorry, you're right. You'd have to customize the corpus reader instead so that it adds the sentence boundaries to example.predicted in the examples.

(Edited: I should say more clearly I hope that would work, but I haven't tested it.)

ANenashev commented 3 years ago

Customized corpus reader works for me. Thank you, @adrianeboyd!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy