explosion / spaCy

๐Ÿ’ซ Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

RunTimeError: Tensor size mismatch with certain Transformers models #7891

Closed Varun-Epi closed 2 years ago

Varun-Epi commented 3 years ago

Objective

To train custom NER on our own dataset using transformers pipeline. We have 15k long documents and have tried different training settings such as max_length range -> 128, 256, 500 but still getting the same error.

Configs

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 256
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 0
max_epochs = 7
max_steps = 0
eval_frequency = 500
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.WandbLogger.v1"
project_name = "monitor_spacy_training"
remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Command Executed

python -m spacy train configs/config.cfg -o training/ --gpu-id 0 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy

Traceback

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 114, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 195, in train_while_improving
    subbatch, drop=dropout, losses=losses, sgd=False, exclude=exclude
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 1107, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/pipeline_component.py", line 286, in update
    trf_full, bp_trf_full = self.model.begin_update(docs)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 306, in begin_update
    return self._func(self, X, is_train=True)
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/layers/transformer_model.py", line 142, in forward
    tensors, bp_tensors = transformer(wordpieces, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 288, in __call__
    return self._func(self, X, is_train=is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/layers/pytorchwrapper.py", line 80, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 27, in __call__
    return self.begin_update(inputs)
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 49, in begin_update
    output = self._model(*inputs.args, **inputs.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_bert.py", line 956, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/models/bert/modeling_bert.py", line 206, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (693) must match the size of tensor b (512) at non-singleton dimension 1

Your Environment

svlandeg commented 3 years ago

Hm. I'm not sure yet what's going on. Just as a sanity check, could you try running the exact same thing but replacing

name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"

with

name = "roberta-base"

To see whether it runs or crashes in the same way?

Varun-Epi commented 3 years ago

Nope, it doesnโ€™t seems to have any issue with "roberta-base" model. Do you think model architecture can be the problem here?

svlandeg commented 3 years ago

Yes, it certainly looks like it. I can't yet say where the actual problem is located though - perhaps spacy-transformers is making some assumptions that aren't valid for all different models in the HF repo.

It could also be an issue with this specific PubMedBERT model, but then I'd expect others (also outside of spaCy) to have seen this error before, and a quick Google search didn't identify any related issues.

I'll label this as a bug for now - we'll have to investigate further.

e3oroush commented 3 years ago

I have the same problem with allenai/scibert_scivocab_uncased model. Any new bug-fixes?

svlandeg commented 3 years ago

This hasn't been resolved yet, no. Progress will be reported in this thread.

svlandeg commented 3 years ago

Maintenance note: potentially related issue: https://github.com/explosion/spaCy/discussions/8846

xavierfontaine commented 2 years ago

I am experiencing the same issue. This time, the model is not a third-party one: it is spaCy's own ja_core_news_trf :thinking:

Environment information

Traceback:

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [1], in <cell line: 130>()
    126 # ===============================
    127 # Tokenization and entity classif
    128 # ===============================
    129 logger.info("Tokenization")
--> 130 spacified_texts = [batch for batch in tqdm.tqdm(nlp.pipe(texts,
    131     batch_size=batch_size),
    132     total=len(texts))
    133 ]
    138 redacted_texts = [
    139     replace_entity(
    140     doc=text,
   (...)
    146     }) for text in spacified_texts
    147 ]

File ~/.achepypoetry/virtualenvs/mylib/lib/python3.9/site-packages/tqdm/std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/language.py:1583, in Language.pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
   1581     for pipe in pipes:
   1582         docs = pipe(docs)
-> 1583 for doc in docs:
   1584     yield doc

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1649, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1647     if arg in kwargs:
   1648         kwargs.pop(arg)
-> 1649 for doc in docs:
   1650     try:
   1651         doc = proc(doc, **kwargs)  # type: ignore[call-arg]

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1649, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1647     if arg in kwargs:
   1648         kwargs.pop(arg)
-> 1649 for doc in docs:
   1650     try:
   1651         doc = proc(doc, **kwargs)  # type: ignore[call-arg]

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1631 def _pipe(
   1632     docs: Iterable["Doc"],
   1633     proc: "Pipe",
   (...)
   1636     kwargs: Mapping[str, Any],
   1637 ) -> Iterator["Doc"]:
   1638     if hasattr(proc, "pipe"):
-> 1639         yield from proc.pipe(docs, **kwargs)
   1640     else:
   1641         # We added some args for pipe that __call__ doesn't expect.
   1642         kwargs = dict(kwargs)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/transition_parser.pyx:233, in pipe()

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
   1586 while True:
   1587     batch_size = next(size_)
-> 1588     batch = list(itertools.islice(items, int(batch_size)))
   1589     if len(batch) == 0:
   1590         break

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1631 def _pipe(
   1632     docs: Iterable["Doc"],
   1633     proc: "Pipe",
   (...)
   1636     kwargs: Mapping[str, Any],
   1637 ) -> Iterator["Doc"]:
   1638     if hasattr(proc, "pipe"):
-> 1639         yield from proc.pipe(docs, **kwargs)
   1640     else:
   1641         # We added some args for pipe that __call__ doesn't expect.
   1642         kwargs = dict(kwargs)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/pipe.pyx:53, in pipe()

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1631 def _pipe(
   1632     docs: Iterable["Doc"],
   1633     proc: "Pipe",
   (...)
   1636     kwargs: Mapping[str, Any],
   1637 ) -> Iterator["Doc"]:
   1638     if hasattr(proc, "pipe"):
-> 1639         yield from proc.pipe(docs, **kwargs)
   1640     else:
   1641         # We added some args for pipe that __call__ doesn't expect.
   1642         kwargs = dict(kwargs)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/transition_parser.pyx:233, in pipe()

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
   1586 while True:
   1587     batch_size = next(size_)
-> 1588     batch = list(itertools.islice(items, int(batch_size)))
   1589     if len(batch) == 0:
   1590         break

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1631 def _pipe(
   1632     docs: Iterable["Doc"],
   1633     proc: "Pipe",
   (...)
   1636     kwargs: Mapping[str, Any],
   1637 ) -> Iterator["Doc"]:
   1638     if hasattr(proc, "pipe"):
-> 1639         yield from proc.pipe(docs, **kwargs)
   1640     else:
   1641         # We added some args for pipe that __call__ doesn't expect.
   1642         kwargs = dict(kwargs)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/pipeline/trainable_pipe.pyx:73, in pipe()

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1588, in minibatch(items, size)
   1586 while True:
   1587     batch_size = next(size_)
-> 1588     batch = list(itertools.islice(items, int(batch_size)))
   1589     if len(batch) == 0:
   1590         break

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy/util.py:1639, in _pipe(docs, proc, name, default_error_handler, kwargs)
   1631 def _pipe(
   1632     docs: Iterable["Doc"],
   1633     proc: "Pipe",
   (...)
   1636     kwargs: Mapping[str, Any],
   1637 ) -> Iterator["Doc"]:
   1638     if hasattr(proc, "pipe"):
-> 1639         yield from proc.pipe(docs, **kwargs)
   1640     else:
   1641         # We added some args for pipe that __call__ doesn't expect.
   1642         kwargs = dict(kwargs)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:212, in Transformer.pipe(self, stream, batch_size)
    210 for indices in batch_by_length(outer_batch, self.cfg["max_batch_items"]):
    211     subbatch = [outer_batch[i] for i in indices]
--> 212     self.set_annotations(subbatch, self.predict(subbatch))
    213 yield from outer_batch

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:228, in Transformer.predict(self, docs)
    226     activations = FullTransformerBatch.empty(len(docs))
    227 else:
--> 228     activations = self.model.predict(docs)
    229 batch_id = TransformerListener.get_batch_id(docs)
    230 for listener in self.listeners:

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/model.py:315, in Model.predict(self, X)
    311 def predict(self, X: InT) -> OutT:
    312     """Call the model's `forward` function with `is_train=False`, and return
    313     only the output, instead of the `(output, callback)` tuple.
    314     """
--> 315     return self._func(self, X, is_train=False)[0]

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/spacy_transformers/layers/transformer_model.py:185, in forward(model, docs, is_train)
    181 align = get_alignment(flat_spans, wordpieces.strings, tokenizer.all_special_tokens)
    182 wordpieces, align = truncate_oversize_splits(
    183     wordpieces, align, tokenizer.model_max_length
    184 )
--> 185 model_output, bp_tensors = transformer(wordpieces, is_train)
    186 if "logger" in model.attrs:
    187     log_gpu_memory(model.attrs["logger"], "after forward")

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/layers/pytorchwrapper.py:143, in forward(model, X, is_train)
    140 convert_outputs = model.attrs["convert_outputs"]
    142 Xtorch, get_dX = convert_inputs(model, X, is_train)
--> 143 Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
    144 Y, get_dYtorch = convert_outputs(model, (X, Ytorch), is_train)
    146 def backprop(dY: Any) -> Any:

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/shims/pytorch.py:72, in PyTorchShim.__call__(self, inputs, is_train)
     70     return self.begin_update(inputs)
     71 else:
---> 72     return self.predict(inputs), lambda a: ...

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/thinc/shims/pytorch.py:90, in PyTorchShim.predict(self, inputs)
     88 with torch.no_grad():
     89     with torch.cuda.amp.autocast(self._mixed_precision):
---> 90         outputs = self._model(*inputs.args, **inputs.kwargs)
     91 self._model.train()
     92 return outputs

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
   1106 # If we don't have any hooks, we want to skip the rest of the logic in
   1107 # this function, and just call forward.
   1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110     return forward_call(*input, **kwargs)
   1111 # Do not call functions when jit is used
   1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1011, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1004 # Prepare head mask if needed
   1005 # 1.0 in head_mask indicate we keep the head
   1006 # attention_probs has shape bsz x n_heads x N x N
   1007 # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
   1008 # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
   1009 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
-> 1011 embedding_output = self.embeddings(
   1012     input_ids=input_ids,
   1013     position_ids=position_ids,
   1014     token_type_ids=token_type_ids,
   1015     inputs_embeds=inputs_embeds,
   1016     past_key_values_length=past_key_values_length,
   1017 )
   1018 encoder_outputs = self.encoder(
   1019     embedding_output,
   1020     attention_mask=extended_attention_mask,
   (...)
   1028     return_dict=return_dict,
   1029 )
   1030 sequence_output = encoder_outputs[0]

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
   1106 # If we don't have any hooks, we want to skip the rest of the logic in
   1107 # this function, and just call forward.
   1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110     return forward_call(*input, **kwargs)
   1111 # Do not call functions when jit is used
   1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.cache/pypoetry/virtualenvs/mylib/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:241, in BertEmbeddings.forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    239 if self.position_embedding_type == "absolute":
    240     position_embeddings = self.position_embeddings(position_ids)
--> 241     embeddings += position_embeddings
    242 embeddings = self.LayerNorm(embeddings)
    243 embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (617) must match the size of tensor b (512) at non-singleton dimension 1
adrianeboyd commented 2 years ago

Can you potentially provide the example text/texts and batch_size that lead to this exception?

xavierfontaine commented 2 years ago

@adrianeboyd Many thanks for your reply.

The batch-size was 2048.

I'd rule out a memory issue. Every time I checked, 2/3 of my GPU memory was empty. Also, I just ran the pipe on a dummy batch containing 2048 copies of my largest document. This went smoothly, and here again, 2/3 of my GPU memory remained empty.

Do you have pointers regarding other possible causes for the error?

I will try to identify the faulty batch and (hoping there is one) the faulty example itself.

EDIT: here is a minimal reproducible example

import spacy

text = """โ– โ– โ– โ– โ– ใ‚ฎใƒƒใƒˆใƒใƒ–โ– โ– โ– โ– โ– 
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผš
-------------------------------------------------------------------------
ใƒปใ‚ฎใƒƒใƒˆใƒใƒ–๏ผšใ‚ฎใƒƒใƒˆใƒใƒ–
-------------------------------------------------------------------------
"""
spacy.require_gpu()
nlp = spacy.load("ja_core_news_trf")
nlp(text)
adrianeboyd commented 2 years ago

Thanks, the example is very helpful! We will look into it...

polm commented 2 years ago

@xavierfontaine It looks like this may be an issue with the tokenizer config. In the directory where the pipeline is saved locally, at the top, there should be a config.cfg file with a tokenizer settings block. Can you add the model_max_length parameter to it like below and see if that fixes things?

[components.transformer.model.tokenizer_config]
use_fast = false
word_tokenizer_type = "basic"
subword_tokenizer_type = "character"
model_max_length = 512
adrianeboyd commented 2 years ago

Just a note that editing this setting in config.cfg for a trained pipeline won't change anything because these settings are only used on initialization. It will work if you're training a new model from a config, though, and we'll add it for the next release of ja_core_news_trf.

If you want to fix the current v3.4 or earlier ja_core_news_trf in a script, you can do this:

nlp = spacy.load("ja_core_news_trf")
nlp.get_pipe("transformer").model.tokenizer.model_max_length = 512
nlp(text)

If you want to save this pipeline to a local path to be able to reload it without changing the settings again, then modify this and save with nlp.to_disk:

nlp.get_pipe("transformer").model.tokenizer.init_kwargs["model_max_length"] = 512

After you reload it from this path, it should have the right tokenizer settings. You can use spacy package if you want to package this directory back into an installable pip package.

github-actions[bot] commented 2 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.