Closed lorelupo closed 2 months ago
Actually, max_length=self.model.config.max_position_embeddings-current_generation_args["max_new_tokens"]-8,
fails if max_position_embeddings
is not present in the model config. Likely this happens when a model use relative position embeddings.
To avoid this, we could instead do:
# set the maximum length of the input text for the model
max_position_embeddings = self.model.config.max_position_embeddings if hasattr(self.model.config, "max_position_embeddings") else None
max_length = max_position_embeddings - current_generation_args["max_new_tokens"] - 8 if max_position_embeddings else None
# Processing the input text
dataset = Dataset.from_dict({"text": texts})
dataset = dataset.map(
lambda x: self.tokenizer(
x["text"],
truncation=True,
max_length=max_length,
),
batched=True,
remove_columns=["text"],
desc="Tokenizing texts",
)
Stale issue message
Hello,
Truncation of the
input_ids
during tokenization, .i.e., line 336, does not work properly, throwing the following warning:And then, in the generation loop :
I suggest replacing
lambda x: self.tokenizer(x["text"], truncation=True)
withand modifying the
_prepare_generation_args
method accordingly:I can do a PR if needed :)