argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.45k stars 111 forks source link

[BUG] GPU utilization depends on targeted dataset size #751

Open fpreiss opened 3 months ago

fpreiss commented 3 months ago

Describe the bug

Generating larger datasets with LoadDataFromDicts leads to underutilization of the GPU during the TextGeneration step.

To Reproduce

Setting N_SAMPLES to a small value in the code below utilizes the GPU as expected, increasing N_SAMPLES to something like 500000 doesn't.

step_empty_prompt_generation = LoadDataFromDicts(
    name="step_empty_prompt_generation",
    data=[
        {
            "instruction": "",
        }
        for sample in repeat("", N_SAMPLES)
    ],
    batch_size=1000,
)
task_instruction_generation = TextGeneration(
    input_batch_size=1000,
    use_system_prompt=False,
    name="task_instruction_generation",
    llm=llm,
    output_mappings={"generation": "input"},
)
step_empty_prompt_generation >> task_instruction_generation

Expected behaviour

I'd expect the GPU to be fully utilized no matter how I set N_SAMPLES.

EDIT: On N_SAMPLES at 500000 I'm seeing ~40s of load followed by ~20s of the GPU just idling around on a 4090 using Ollama with llama3:8b-instruct-fp16

Desktop:

Additional context I'm passing empty strings to the prompt in order to reproduce parts of magpie paper. The dataset I generated with this approach so far can be found on https://huggingface.co/datasets/fpreiss/Llama-3-Magpie-Air-500K-unprocessed-v0.1

gabrielmbmb commented 3 months ago

Hi @fpreiss, I'll test what could be happening, maybe the Step adds some overhead that doesn't allow to saturate the GPU.

fpreiss commented 3 months ago

Thanks. Something else I noticed is the extensive file size of the pipeline.yaml file. Apparently It includes every single data point created by LoadDataFromDicts. From the file on huggingface:

  - step:
      name: step_empty_prompt_generation
      input_mappings: {}
      output_mappings: {}
      batch_size: 1000
      data:
        '0':
          instruction: '-'
        '1':
          instruction: '-'
        '2':
          instruction: '-'
        '3':
          instruction: '-'
        '4':
          instruction: '-'
        '5':
          instruction: '-'

Notice that the instruction string here is - instead of an empty string. This is due to a separate issue I had when convincing ollama to not use the default instruction template of the model.