0xPlaygrounds / rig

A library for developing LLM-powered Rust applications.
https://rig.rs
MIT License
81 stars 3 forks source link

feat: Add PDF Loader to Document Loaders in Rig #24

Open Tachikoma000 opened 6 days ago

Tachikoma000 commented 6 days ago

Feature Request

Add a PDF Loader to the Document Loaders in Rig

Motivation

PDF is a widely used format for storing and sharing documents. Adding support for loading PDF files would significantly enhance Rig's capability to process and analyze a broader range of document types. This feature would allow users to easily incorporate PDF documents into their RAG (Retrieval-Augmented Generation) systems and other LLM tasks.

Use cases include:

  1. Extracting information from technical documentation stored in PDFs
  2. Analyzing academic papers and research reports
  3. Processing business documents and reports
  4. Incorporating legal documents into NLP workflows

Proposal

Implement a PdfLoader as part of the document_loaders module. The implementation should:

  1. Create a new file src/document_loaders/pdf.rs
  2. Implement a PdfLoader struct that implements the DocumentLoader trait
  3. Use the lopdf crate for parsing PDF files
  4. Extract text content from PDF documents
  5. Convert extracted content into DocumentEmbeddings

Basic structure:

use async_trait::async_trait;
use lopdf::Document;
use crate::embeddings::DocumentEmbeddings;
use super::DocumentLoader;

pub struct PdfLoader {
    path: String,
}

impl PdfLoader {
    pub fn new(path: &str) -> Self {
        Self { path: path.to_string() }
    }
}

#[async_trait]
impl DocumentLoader for PdfLoader {
    async fn load(&self) -> Result<Vec<DocumentEmbeddings>, Box<dyn std::error::Error + Send + Sync>> {
        // Implementation here
    }
}

Additional considerations:

Alternatives

  1. Use a different PDF parsing library: We could use libraries like pdf-extract or pdf-rs instead of lopdf. However, lopdf seems to offer a good balance of features and performance.

  2. Implement PDF parsing from scratch: This would give us more control but would be time-consuming and potentially error-prone.

  3. Use external tools: We could use external command-line tools like pdftotext and call them from Rust. This would be simpler to implement but would add external dependencies and potential security risks.

  4. Defer PDF support to users: We could provide a trait for document loading and let users implement PDF support themselves. This would be simpler for Rig but would push complexity to the users.

The proposed solution (using lopdf) was chosen because it provides a good balance of functionality, ease of implementation, and integration with Rust. It keeps the implementation within Rig, providing a cohesive experience for users without external dependencies.