datadreamer-dev / DataDreamer

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models.   🤖💤
https://datadreamer.dev
MIT License
721 stars 39 forks source link

The parameter `add_special_tokens` is not a valid input parameter for transformers' `pipeline` #29

Open eriknovak opened 2 weeks ago

eriknovak commented 2 weeks ago

The datadreamer.llms.HFTransformers uses the pipeline provided by the Huggingface's transformers package. However, the pipeline is provided the add_special_tokens parameters, which is not a valid parameter (please see the docs).

The bug in question is located in the src/llms/hf_transformers.py file, specifically here: https://github.com/datadreamer-dev/DataDreamer/blob/c535dc4482906e5886e2d4009edd64d306e5dd4e/src/llms/hf_transformers.py#L464

We found this bug when we tried to run the code for creating synthetic data using the huggingface models, specifically meta-llama/Meta-Llama-3-8B-Instruct.

AjayP13 commented 2 weeks ago

Hi @eriknovak, thanks for reporting this, let me look into this in a bit.

add_special_tokens should get forwarded to the model's tokenizer class which should support that parameter. We also test this on CI/CD. Can you give more details on the bug your getting and if possible the version of transformers you are using? It may be related to an update from the HF libraries.

In the meantime, if you are sure that's the bug, hopefully you can patch it locally in your site-packages until we can push a fix!

eriknovak commented 2 weeks ago

Hi @AjayP13, not a problem! I did patch it locally so that I can continue with my work.

I am using the following version of the transformers packages: 4.41.1

The error I get is the following:

Traceback (most recent call last):
  File "./scripts/test_datadreamer.py", line 28, in <module>
    name_data = ProcessWithPrompt(
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 337, in __init__
    self.__setup_folder_and_resume()
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
    self.__start()
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 451, in __start
    self._set_output(self.run())
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step.py", line 894, in _set_output
    self.__output = _output_to_dataset(
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step_output.py", line 862, in _output_to_dataset
    output = __output_to_dataset(
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/step_output.py", line 559, in __output_to_dataset
    first_row = next(
  File "./venv/lib/python3.10/site-packages/datadreamer/steps/prompt/_prompt_base.py", line 105, in get_generations
    for input, prompt, generation, get_extra_columns in zip(
  File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 805, in _run_over_batches
    yield from self._run_over_batches_locked(
  File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 771, in _run_over_batches_locked
    results = self._run_over_sorted_batches(
  File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 575, in _run_over_sorted_batches
    self._adaptive_run_batch(
  File "./venv/lib/python3.10/site-packages/datadreamer/_cachable/_cachable.py", line 339, in _adaptive_run_batch
    predicted_results_sub_batch = run_batch(
  File "./venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "./venv/lib/python3.10/site-packages/datadreamer/llms/hf_transformers.py", line 460, in _run_batch
    for batch in pipe(
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 263, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1224, in __call__
    outputs = list(final_iterator)
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1150, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "./venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 350, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "./venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "./venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1542, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "./venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1157, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['add_special_tokens'] (note: typos in the generate arguments will also show up in this list)

I am also copying an example code where the error is thrown (using Python 3.10):

from datadreamer import DataDreamer
from datadreamer.llms import HFTransformers
from datadreamer.steps import ProcessWithPrompt, DataSource

from transformers import QuantoConfig
from datasets import Dataset

names = {"name": ["George", "Martin", "Steve"]}
sample_data = Dataset.from_dict(names).select_columns(["name"])

quantization_config = QuantoConfig(weights="int8")

with DataDreamer("./output"):
    # Load HF Transformer
    hf = HFTransformers(
        model_name="meta-llama/Meta-Llama-3-8B-Instruct",
        quantization_config=quantization_config,
        device_map="cuda:0",
        device="cuda",
    )

    model_instruction = (
        "Given the provided name, generate a surname that rhymes with the name."
        "Return only the list, nothing else."
    )

    name_data = DataSource("Names", sample_data)
    name_data = ProcessWithPrompt(
        "Generate Rhyming Names",
        inputs={"inputs": name_data.output["name"]},
        args={
            "llm": hf,
            "instruction": model_instruction,
            "max_new_tokens": 1500,
        },
        outputs={"inputs": "name", "generations": "surname"},
    )

    print(name_data.output)