Open fpreiss opened 3 months ago
Hi @fpreiss, I'll test what could be happening, maybe the Step
adds some overhead that doesn't allow to saturate the GPU.
Thanks. Something else I noticed is the extensive file size of the pipeline.yaml
file. Apparently It includes every single data point created by LoadDataFromDicts
. From the file on huggingface:
- step:
name: step_empty_prompt_generation
input_mappings: {}
output_mappings: {}
batch_size: 1000
data:
'0':
instruction: '-'
'1':
instruction: '-'
'2':
instruction: '-'
'3':
instruction: '-'
'4':
instruction: '-'
'5':
instruction: '-'
Notice that the instruction string here is -
instead of an empty string. This is due to a separate issue I had when convincing ollama to not use the default instruction template of the model.
Describe the bug
Generating larger datasets with
LoadDataFromDicts
leads to underutilization of the GPU during theTextGeneration
step.To Reproduce
Setting
N_SAMPLES
to a small value in the code below utilizes the GPU as expected, increasingN_SAMPLES
to something like500000
doesn't.Expected behaviour
I'd expect the GPU to be fully utilized no matter how I set
N_SAMPLES
.EDIT: On
N_SAMPLES
at 500000 I'm seeing ~40s of load followed by ~20s of the GPU just idling around on a 4090 using Ollama with llama3:8b-instruct-fp16Desktop:
Additional context I'm passing empty strings to the prompt in order to reproduce parts of magpie paper. The dataset I generated with this approach so far can be found on https://huggingface.co/datasets/fpreiss/Llama-3-Magpie-Air-500K-unprocessed-v0.1