getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.58k stars 358 forks source link

[Feature Request] Consider Adding Batch Processing Support to Reduce Azure AI Costs #91

Open lambolambert opened 2 weeks ago

lambolambert commented 2 weeks ago

I'd like to propose adding batch processing capabilities to optimize costs when processing documents through Azure AI services. Currently, it seems each page/document requires individual API calls, which could become costly at scale

Current Challenge From what I understand, the system processes documents individually, which means:

Benefits

~enhancement ~cost-optimization

tylermaran commented 2 weeks ago

Hey @lambolambert. Absolutely something we've planned on adding. It will probably start with the OpenAI batch api, and then expand to azure.

Although this would change the implementation quite a bit. It would make a single request with all the documents, and then you would get back a batch id.

Workflow would go something like:

  1. Run zerox with a folder of files in batch mode. You will also need to pass in AWS S3 credentials.
  2. Zerox creates a .jsonl file with the files.
    {"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "Markdown pls."},{"role": "user", "content": "https:s3.aws.com/my_file!"}]}}
    {"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "Markdown pls."},{"role": "user", "content": "https:s3.aws.com/my_file!"}]}}
  3. Upload that file to OpenAI, return the job response:
    {
    "id": "batch_abc123",
    "object": "batch",
    "endpoint": "/v1/chat/completions",
    "errors": null,
    "input_file_id": "file-abc123",
    "completion_window": "24h",
    "status": "validating",
    "output_file_id": null,
    "error_file_id": null,
    "created_at": 1714508499,
    "in_progress_at": null,
    "expires_at": 1714536634,
    "completed_at": null,
    "failed_at": null,
    "expired_at": null,
    "request_counts": {
    "total": 0,
    "completed": 0,
    "failed": 0
    },
    "metadata": null
    }
  4. ??? You have the job id, and results will be delivered within 24 hours. But not sure what the next step would be on the zerox side. Does it make sense to have a pingForResults function? Since we would need to aggregate all the completion responses into the expected markdown format.
lambolambert commented 2 weeks ago

Hi @tylermaran Exciting to hear that batch processing is on the roadmap.

Aggregating responses into a unified format (e.g., markdown or JSON) would be essential. Perhaps implementing a handler that can format the results and store them in a standardized way would streamline post-processing. This could also allow for automatic handling of individual file results, error logging, and even partial retry for any failed documents in the batch. This could be using the custom_id so its doc1_page1, doc1_page2 etc that then uses the orchestrator to bring things back all together and outputs in the class?

Having a pingForResults function could work well for managing the asynchronous nature of batch processing, especially for use cases where it’s critical to track the status of each batch job over extended periods (like 24 hours). A pingForResults function could regularly check for updates and retrieve results once the batch is complete.

Thanks for considering this enhancement. Looking forward to seeing how it develops, and happy to help once it’s underway!

kzbao commented 1 week ago

Hi @lambolambert, @tylermaran (fellow YC founder here),

We ran into this exact feature request as we're processing thousands of PDF files asynchronously. We initially used zerox and then realized there was a chance to use the new batch APIs so we put together a library to help do so at https://github.com/Summed-AI/parallex. Would love any feedback and Tyler, perhaps a chance to collaborate if that's something you're interested in.