aws-samples / bedrock-claude-chat

AWS-native chatbot using Bedrock + Claude (+Mistral)
MIT No Attribution
693 stars 237 forks source link

[Feature Request] Efficient and Parallel Processing of PDF Files for Embedding Task #297

Closed statefb closed 1 month ago

statefb commented 1 month ago

Describe the solution you'd like

A solution that allows for efficient and parallel processing of PDF files during the embedding task. This solution should consider the following aspects:

  1. Differential Processing: Instead of processing all PDF files from scratch, the solution should be able to identify and process only the modified or new files, reducing redundant computations.

  2. Parallel Processing: The solution should leverage parallel processing capabilities to speed up the embedding task for multiple PDF files simultaneously.

Why the solution is needed

Currently, processing 100 PDF files with varying page counts (ranging from 10 to 100 pages) takes more than two hours. This is due to the current approach, which processes all files in a batch without considering any differences or utilizing parallel processing capabilities.

By implementing differential processing and parallel processing, the overall processing time can be significantly reduced, leading to improved efficiency and faster turnaround times.

Additional context

Possible solutions

  1. Hash-based Differential Processing:

    • Store the hash values of the processed PDF files along with their embedding data.
    • During the embedding task, calculate the hash values of the PDF files and compare them with the stored hash values.
    • Process only the files whose hash values have changed, indicating a modification or a new file.
  2. Temporary Table for Differential Processing:

    • Create a temporary table to store the update information for modified or new PDF files during the embedding task.
    • After processing all files, perform a batch update on the main table using the data from the temporary table.
  3. Parallel Processing:

    • Split the PDF files into smaller batches and process them concurrently, leveraging the available computing resources.

By combining these solutions, we can achieve efficient differential processing and parallel processing for the embedding task, significantly reducing the overall processing time for large numbers of PDF files.