aws-samples / Serverless-Retrieval-Augmented-Generation-RAG-on-AWS

A full-stack serverless RAG workflow. This is thought for running PoCs, prototypes and bootstrap your MVP.
MIT No Attribution
50 stars 19 forks source link

Enhancement: large PDF splitting #22

Open giusedroid opened 5 months ago

giusedroid commented 5 months ago

Running some tests we found out that embedding large documents will cause the system to time out. The timeout for the ingestion lambda is set to 300 seconds. Rather than just increase it, we would like to split large pdfs into few predictable parts and process them in parallel. We're also artificially limiting the concurrency of the processor function to 1. We'd love to remove this once the locking system for LanceDB is out of Beta.