Closed Varun-Epi closed 2 years ago
Hm. I'm not sure yet what's going on. Just as a sanity check, could you try running the exact same thing but replacing
name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"
with
name = "roberta-base"
To see whether it runs or crashes in the same way?
Nope, it doesnโt seems to have any issue with "roberta-base" model. Do you think model architecture can be the problem here?
Yes, it certainly looks like it. I can't yet say where the actual problem is located though - perhaps spacy-transformers
is making some assumptions that aren't valid for all different models in the HF repo.
It could also be an issue with this specific PubMedBERT model, but then I'd expect others (also outside of spaCy) to have seen this error before, and a quick Google search didn't identify any related issues.
I'll label this as a bug for now - we'll have to investigate further.
I have the same problem with allenai/scibert_scivocab_uncased
model. Any new bug-fixes?
This hasn't been resolved yet, no. Progress will be reported in this thread.
Maintenance note: potentially related issue: https://github.com/explosion/spaCy/discussions/8846
I am experiencing the same issue. This time, the model is not a third-party one: it is spaCy's own ja_core_news_trf
:thinking:
Environment information
Traceback:
--------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [1], in <cell line: 130>()
126 # ===============================
127 # Tokenization and entity classif
128 # ===============================
129 logger.info("Tokenization")
--> 130 spacified_texts = [batch for batch in tqdm.tqdm(nlp.pipe(texts,
131 batch_size=batch_size),
132 total=len(texts))
133 ]
138 redacted_texts = [
139 replace_entity(
140 doc=text,
(...)
146 }) for text in spacified_texts
147 ]
File ~/.achepypoetry/virtualenvs/mylib/lib/python3.9/site-packages/tqdm/std.py:1195, in tqdm.__iter__(self)
1192 time = self._time
1194 try:
-> 1195 for obj in iterable:
1196 yield obj
1197 # Update and possibly print the progressbar.
1198 # Note: does not call self.update(1) for speed optimisation.
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/language.py:1583, in Language.pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
1581 for pipe in pipes:
1582 docs = pipe(docs)
-> 1583 for doc in docs:
1584 yield doc
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1649, in _pipe(docs, proc, name, default_error_handler, kwargs)
1647 if arg in kwargs:
1648 kwargs.pop(arg)
-> 1649 for doc in docs:
1650 try:
1651 doc = proc(doc, **kwargs) # type: ignore[call-arg]
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1649, in _pipe(docs, proc, name, default_error_handler, kwargs)
1647 if arg in kwargs:
1648 kwargs.pop(arg)
-> 1649 for doc in docs:
1650 try:
1651 doc = proc(doc, **kwargs) # type: ignore[call-arg]
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
1631 def _pipe(
1632 docs: Iterable["Doc"],
1633 proc: "Pipe",
(...)
1636 kwargs: Mapping[str, Any],
1637 ) -> Iterator["Doc"]:
1638 if hasattr(proc, "pipe"):
-> 1639 yield from proc.pipe(docs, **kwargs)
1640 else:
1641 # We added some args for pipe that __call__ doesn't expect.
1642 kwargs = dict(kwargs)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/transition_parser.pyx:233, in pipe()
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
1586 while True:
1587 batch_size = next(size_)
-> 1588 batch = list(itertools.islice(items, int(batch_size)))
1589 if len(batch) == 0:
1590 break
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
1631 def _pipe(
1632 docs: Iterable["Doc"],
1633 proc: "Pipe",
(...)
1636 kwargs: Mapping[str, Any],
1637 ) -> Iterator["Doc"]:
1638 if hasattr(proc, "pipe"):
-> 1639 yield from proc.pipe(docs, **kwargs)
1640 else:
1641 # We added some args for pipe that __call__ doesn't expect.
1642 kwargs = dict(kwargs)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/pipe.pyx:53, in pipe()
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
1631 def _pipe(
1632 docs: Iterable["Doc"],
1633 proc: "Pipe",
(...)
1636 kwargs: Mapping[str, Any],
1637 ) -> Iterator["Doc"]:
1638 if hasattr(proc, "pipe"):
-> 1639 yield from proc.pipe(docs, **kwargs)
1640 else:
1641 # We added some args for pipe that __call__ doesn't expect.
1642 kwargs = dict(kwargs)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/transition_parser.pyx:233, in pipe()
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
1586 while True:
1587 batch_size = next(size_)
-> 1588 batch = list(itertools.islice(items, int(batch_size)))
1589 if len(batch) == 0:
1590 break
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
1631 def _pipe(
1632 docs: Iterable["Doc"],
1633 proc: "Pipe",
(...)
1636 kwargs: Mapping[str, Any],
1637 ) -> Iterator["Doc"]:
1638 if hasattr(proc, "pipe"):
-> 1639 yield from proc.pipe(docs, **kwargs)
1640 else:
1641 # We added some args for pipe that __call__ doesn't expect.
1642 kwargs = dict(kwargs)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/trainable_pipe.pyx:73, in pipe()
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
1586 while True:
1587 batch_size = next(size_)
-> 1588 batch = list(itertools.islice(items, int(batch_size)))
1589 if len(batch) == 0:
1590 break
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
1631 def _pipe(
1632 docs: Iterable["Doc"],
1633 proc: "Pipe",
(...)
1636 kwargs: Mapping[str, Any],
1637 ) -> Iterator["Doc"]:
1638 if hasattr(proc, "pipe"):
-> 1639 yield from proc.pipe(docs, **kwargs)
1640 else:
1641 # We added some args for pipe that __call__ doesn't expect.
1642 kwargs = dict(kwargs)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:212, in Transformer.pipe(self, stream, batch_size)
210 for indices in batch_by_length(outer_batch, self.cfg["max_batch_items"]):
211 subbatch = [outer_batch[i] for i in indices]
--> 212 self.set_annotations(subbatch, self.predict(subbatch))
213 yield from outer_batch
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:228, in Transformer.predict(self, docs)
226 activations = FullTransformerBatch.empty(len(docs))
227 else:
--> 228 activations = self.model.predict(docs)
229 batch_id = TransformerListener.get_batch_id(docs)
230 for listener in self.listeners:
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/model.py:315, in Model.predict(self, X)
311 def predict(self, X: InT) -> OutT:
312 """Call the model's `forward` function with `is_train=False`, and return
313 only the output, instead of the `(output, callback)` tuple.
314 """
--> 315 return self._func(self, X, is_train=False)[0]
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/layers/transformer_model.py:185, in forward(model, docs, is_train)
181 align = get_alignment(flat_spans, wordpieces.strings, tokenizer.all_special_tokens)
182 wordpieces, align = truncate_oversize_splits(
183 wordpieces, align, tokenizer.model_max_length
184 )
--> 185 model_output, bp_tensors = transformer(wordpieces, is_train)
186 if "logger" in model.attrs:
187 log_gpu_memory(model.attrs["logger"], "after forward")
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/layers/pytorchwrapper.py:143, in forward(model, X, is_train)
140 convert_outputs = model.attrs["convert_outputs"]
142 Xtorch, get_dX = convert_inputs(model, X, is_train)
--> 143 Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
144 Y, get_dYtorch = convert_outputs(model, (X, Ytorch), is_train)
146 def backprop(dY: Any) -> Any:
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/shims/pytorch.py:72, in PyTorchShim.__call__(self, inputs, is_train)
70 return self.begin_update(inputs)
71 else:
---> 72 return self.predict(inputs), lambda a: ...
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/shims/pytorch.py:90, in PyTorchShim.predict(self, inputs)
88 with torch.no_grad():
89 with torch.cuda.amp.autocast(self._mixed_precision):
---> 90 outputs = self._model(*inputs.args, **inputs.kwargs)
91 self._model.train()
92 return outputs
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1011, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
1004 # Prepare head mask if needed
1005 # 1.0 in head_mask indicate we keep the head
1006 # attention_probs has shape bsz x n_heads x N x N
1007 # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
1008 # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
1009 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
-> 1011 embedding_output = self.embeddings(
1012 input_ids=input_ids,
1013 position_ids=position_ids,
1014 token_type_ids=token_type_ids,
1015 inputs_embeds=inputs_embeds,
1016 past_key_values_length=past_key_values_length,
1017 )
1018 encoder_outputs = self.encoder(
1019 embedding_output,
1020 attention_mask=extended_attention_mask,
(...)
1028 return_dict=return_dict,
1029 )
1030 sequence_output = encoder_outputs[0]
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:241, in BertEmbeddings.forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
239 if self.position_embedding_type == "absolute":
240 position_embeddings = self.position_embeddings(position_ids)
--> 241 embeddings += position_embeddings
242 embeddings = self.LayerNorm(embeddings)
243 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (617) must match the size of tensor b (512) at non-singleton dimension 1
Can you potentially provide the example text/texts and batch_size that lead to this exception?
@adrianeboyd Many thanks for your reply.
The batch-size was 2048.
I'd rule out a memory issue. Every time I checked, 2/3 of my GPU memory was empty. Also, I just ran the pipe on a dummy batch containing 2048 copies of my largest document. This went smoothly, and here again, 2/3 of my GPU memory remained empty.
Do you have pointers regarding other possible causes for the error?
I will try to identify the faulty batch and (hoping there is one) the faulty example itself.
EDIT: here is a minimal reproducible example
import spacy
text = """โ โ โ โ โ ใฎใใใใโ โ โ โ โ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผใฎใใใใ
ใปใฎใใใใ๏ผใฎใใใใ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผใฎใใใใ
ใปใฎใใใใ๏ผใฎใใใใ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผใฎใใใใ
ใปใฎใใใใ๏ผใฎใใใใ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผใฎใใใใ
ใปใฎใใใใ๏ผใฎใใใใ
ใปใฎใใใใ๏ผใฎใใใใ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผ
-------------------------------------------------------------------------
ใปใฎใใใใ๏ผใฎใใใใ
-------------------------------------------------------------------------
"""
spacy.require_gpu()
nlp = spacy.load("ja_core_news_trf")
nlp(text)
Thanks, the example is very helpful! We will look into it...
@xavierfontaine It looks like this may be an issue with the tokenizer config. In the directory where the pipeline is saved locally, at the top, there should be a config.cfg
file with a tokenizer settings block. Can you add the model_max_length
parameter to it like below and see if that fixes things?
[components.transformer.model.tokenizer_config]
use_fast = false
word_tokenizer_type = "basic"
subword_tokenizer_type = "character"
model_max_length = 512
Just a note that editing this setting in config.cfg
for a trained pipeline won't change anything because these settings are only used on initialization. It will work if you're training a new model from a config, though, and we'll add it for the next release of ja_core_news_trf
.
If you want to fix the current v3.4 or earlier ja_core_news_trf
in a script, you can do this:
nlp = spacy.load("ja_core_news_trf")
nlp.get_pipe("transformer").model.tokenizer.model_max_length = 512
nlp(text)
If you want to save this pipeline to a local path to be able to reload it without changing the settings again, then modify this and save with nlp.to_disk
:
nlp.get_pipe("transformer").model.tokenizer.init_kwargs["model_max_length"] = 512
After you reload it from this path, it should have the right tokenizer settings. You can use spacy package
if you want to package this directory back into an installable pip package.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Objective
To train custom NER on our own dataset using transformers pipeline. We have 15k long documents and have tried different training settings such as
max_length
range ->128, 256, 500
but still getting the same error.Configs
Command Executed
python -m spacy train configs/config.cfg -o training/ --gpu-id 0 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy
Traceback
Your Environment
Cuda11.0
&&torch==1.7.1+cu110
&&scikit-learn==0.24.1