argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.4k stars 95 forks source link

[FEATURE] OpenAI Batch API Support #538

Open jphme opened 5 months ago

jphme commented 5 months ago

See here for the API and here for the (Twitter) announcement.

50% discount would be huge as most large jobs running on distilabel are not time-critical for us.

However there might be some architectural decisions (should a pipeline just loop infinitely and wait for results?) around implementing this.

If you agree that this should be in distilabel and we fletched out how this fits into the API, our team could probably do the implementation.

gabrielmbmb commented 5 months ago

Hi @jphme!

The current architecture of distilabel assumes that LLM.generate is a blocking method that returns the generations right away.

Having that said, I think that we could have a GeneratorStep that uploads a file with the requests and send the request to create the batch job for that file. If the output file gets updated for every request finished, then this step could be pooling the output file and yield batches with the response for the requests of the batch that have finished.

If the output file doesn't get updated when a request of the batch job have finished, then we could have a LoadOpenAIBatchResults generator step that checks if a batch job have finished and if it has, then it reads the output file and yields batches of the specified size for the rest of the pipeline.