Open Tachikoma000 opened 6 days ago
I haven't done a formal review but I noticed a couple of things when I looked over this!
cargo feature
on the main rig library so that not all users download these deps if they install the library.
tokio
likely shouldn't be one of the dependencies since the library can work with multiple async backends.rig-pdf
(or rig-loaders
) instead of keeping this in the main library would be more suitable.mod.rs
file which I think we are trying to avoid. It could be spun out into a document.rs
file.println
code in our library to avoid erroneous output when people use our library. It should be using the tracing
module so that logging can be turned off and on at will.Error
typing could be simplified using thiserror
types for the return values.These are just some first thought comments, I think stopher will do a main review of this. Looking forward to trying this out locally!
Add PDF Loader to Document Loaders
This PR implements a PDF loader as part of the document loaders module in Rig. It allows users to easily load and process PDF documents for use in RAG systems and other document processing tasks.
Changes
PdfLoader
struct insrc/document_loaders/pdf.rs
PdfLoader
to thedocument_loaders
moduleDocumentLoader
trait forPdfLoader
lopdf
crate for PDF parsingCargo.toml
with thelopdf
dependencyPdfLoader
PdfLoader
usage examplesImplementation Details
The
PdfLoader
uses thelopdf
crate to parse PDF files and extract text content. It handles potential errors such as file not found or parsing errors. The extracted text is converted intoDocumentEmbeddings
for further processing in Rig.Testing
Unit tests have been added to ensure the
PdfLoader
correctly loads PDF files and handles various edge cases. The tests cover:Documentation
Code files are commented and some docstrings added
Related Issue
Closes #24
Checklist
Additional Notes
This implementation focuses on text extraction from PDFs. Future enhancements could include handling PDFs with complex layouts or embedded images.