Open eriknovak opened 2 weeks ago
Hi @eriknovak, thanks for reporting this, let me look into this in a bit.
add_special_tokens
should get forwarded to the model's tokenizer class which should support that parameter. We also test this on CI/CD. Can you give more details on the bug your getting and if possible the version of transformers
you are using? It may be related to an update from the HF libraries.
In the meantime, if you are sure that's the bug, hopefully you can patch it locally in your site-packages until we can push a fix!
Hi @AjayP13, not a problem! I did patch it locally so that I can continue with my work.
I am using the following version of the transformers
packages: 4.41.1
The error I get is the following:
Traceback (most recent call last):
File "./scripts/test_datadreamer.py", line 28, in <module>
name_data = ProcessWithPrompt(
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 337, in __init__
self.__setup_folder_and_resume()
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
self.__start()
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 451, in __start
self._set_output(self.run())
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 894, in _set_output
self.__output = _output_to_dataset(
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step_output.py", line 862, in _output_to_dataset
output = __output_to_dataset(
File "./venv/lib/python3.10/site-packages/datadreamer/steps/step_output.py", line 559, in __output_to_dataset
first_row = next(
File "./venv/lib/python3.10/site-packages/datadreamer/steps/prompt/_prompt_base.py", line 105, in get_generations
for input, prompt, generation, get_extra_columns in zip(
File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 805, in _run_over_batches
yield from self._run_over_batches_locked(
File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 771, in _run_over_batches_locked
results = self._run_over_sorted_batches(
File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 575, in _run_over_sorted_batches
self._adaptive_run_batch(
File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 339, in _adaptive_run_batch
predicted_results_sub_batch = run_batch(
File "./venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "./venv/lib/python3.10/site-packages/datadreamer/llms/hf_transformers.py", line 460, in _run_batch
for batch in pipe(
File "./venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 263, in __call__
return super().__call__(text_inputs, **kwargs)
File "./venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1224, in __call__
outputs = list(final_iterator)
File "./venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "./venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "./venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1150, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "./venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 350, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "./venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "./venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1542, in generate
self._validate_model_kwargs(model_kwargs.copy())
File "./venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1157, in _validate_model_kwargs
raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['add_special_tokens'] (note: typos in the generate arguments will also show up in this list)
I am also copying an example code where the error is thrown (using Python 3.10):
from datadreamer import DataDreamer
from datadreamer.llms import HFTransformers
from datadreamer.steps import ProcessWithPrompt, DataSource
from transformers import QuantoConfig
from datasets import Dataset
names = {"name": ["George", "Martin", "Steve"]}
sample_data = Dataset.from_dict(names).select_columns(["name"])
quantization_config = QuantoConfig(weights="int8")
with DataDreamer("./output"):
# Load HF Transformer
hf = HFTransformers(
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=quantization_config,
device_map="cuda:0",
device="cuda",
)
model_instruction = (
"Given the provided name, generate a surname that rhymes with the name."
"Return only the list, nothing else."
)
name_data = DataSource("Names", sample_data)
name_data = ProcessWithPrompt(
"Generate Rhyming Names",
inputs={"inputs": name_data.output["name"]},
args={
"llm": hf,
"instruction": model_instruction,
"max_new_tokens": 1500,
},
outputs={"inputs": "name", "generations": "surname"},
)
print(name_data.output)
The
datadreamer.llms.HFTransformers
uses thepipeline
provided by the Huggingface'stransformers
package. However, the pipeline is provided theadd_special_tokens
parameters, which is not a valid parameter (please see the docs).The bug in question is located in the src/llms/hf_transformers.py file, specifically here: https://github.com/datadreamer-dev/DataDreamer/blob/c535dc4482906e5886e2d4009edd64d306e5dd4e/src/llms/hf_transformers.py#L464
We found this bug when we tried to run the code for creating synthetic data using the huggingface models, specifically
meta-llama/Meta-Llama-3-8B-Instruct
.