0xPlaygrounds / rig

A library for developing LLM-powered Rust applications.
https://rig.rs
MIT License
81 stars 3 forks source link

feat(loaders): document loader pdf #26

Open Tachikoma000 opened 6 days ago

Tachikoma000 commented 6 days ago

Add PDF Loader to Document Loaders

This PR implements a PDF loader as part of the document loaders module in Rig. It allows users to easily load and process PDF documents for use in RAG systems and other document processing tasks.

Changes

Implementation Details

The PdfLoader uses the lopdf crate to parse PDF files and extract text content. It handles potential errors such as file not found or parsing errors. The extracted text is converted into DocumentEmbeddings for further processing in Rig.

Testing

Unit tests have been added to ensure the PdfLoader correctly loads PDF files and handles various edge cases. The tests cover:

Documentation

Code files are commented and some docstrings added

Related Issue

Closes #24

Checklist

Additional Notes

This implementation focuses on text extraction from PDFs. Future enhancements could include handling PDFs with complex layouts or embedded images.

0xMochan commented 5 days ago

I haven't done a formal review but I noticed a couple of things when I looked over this!

These are just some first thought comments, I think stopher will do a main review of this. Looking forward to trying this out locally!