Open lambolambert opened 2 weeks ago
Hey @lambolambert. Absolutely something we've planned on adding. It will probably start with the OpenAI batch api, and then expand to azure.
Although this would change the implementation quite a bit. It would make a single request with all the documents, and then you would get back a batch id.
Workflow would go something like:
.jsonl
file with the files.
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "Markdown pls."},{"role": "user", "content": "https:s3.aws.com/my_file!"}]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "Markdown pls."},{"role": "user", "content": "https:s3.aws.com/my_file!"}]}}
{
"id": "batch_abc123",
"object": "batch",
"endpoint": "/v1/chat/completions",
"errors": null,
"input_file_id": "file-abc123",
"completion_window": "24h",
"status": "validating",
"output_file_id": null,
"error_file_id": null,
"created_at": 1714508499,
"in_progress_at": null,
"expires_at": 1714536634,
"completed_at": null,
"failed_at": null,
"expired_at": null,
"request_counts": {
"total": 0,
"completed": 0,
"failed": 0
},
"metadata": null
}
pingForResults
function? Since we would need to aggregate all the completion responses into the expected markdown format. Hi @tylermaran Exciting to hear that batch processing is on the roadmap.
Aggregating responses into a unified format (e.g., markdown or JSON) would be essential. Perhaps implementing a handler that can format the results and store them in a standardized way would streamline post-processing. This could also allow for automatic handling of individual file results, error logging, and even partial retry for any failed documents in the batch. This could be using the custom_id so its doc1_page1, doc1_page2 etc that then uses the orchestrator to bring things back all together and outputs in the class?
Having a pingForResults function could work well for managing the asynchronous nature of batch processing, especially for use cases where it’s critical to track the status of each batch job over extended periods (like 24 hours). A pingForResults function could regularly check for updates and retrieve results once the batch is complete.
Thanks for considering this enhancement. Looking forward to seeing how it develops, and happy to help once it’s underway!
Hi @lambolambert, @tylermaran (fellow YC founder here),
We ran into this exact feature request as we're processing thousands of PDF files asynchronously. We initially used zerox and then realized there was a chance to use the new batch APIs so we put together a library to help do so at https://github.com/Summed-AI/parallex. Would love any feedback and Tyler, perhaps a chance to collaborate if that's something you're interested in.
I'd like to propose adding batch processing capabilities to optimize costs when processing documents through Azure AI services. Currently, it seems each page/document requires individual API calls, which could become costly at scale
Current Challenge From what I understand, the system processes documents individually, which means:
Benefits
~enhancement ~cost-optimization