Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.51k stars 623 forks source link

Feature Request: Support for file ingestion via byte-streams #735

Open steenfuentes opened 1 day ago

steenfuentes commented 1 day ago

Feature Request: Add Support for Document Ingestion via Byte Streams

Current Behavior

Currently, paperQA's document reading/processing machinery (read_doc and related parsing functions) only supports file path-based document ingestion. This requires documents to be saved to the filesystem (or requires use of the tempfiles module as a workaround) before they can be processed.

Proposed Feature

Add support for processing documents directly from byte streams (bytes/BytesIO) in addition to the existing file path functionality. This would allow for processing documents from memory without having to jump through the filesystem operations overhead.

This would make paper-qa more amenable to integration into web applications and reduce I/O overhead in high-throughput scenarios.

Implementation Proposal

  1. Extending the input types for read_doc and parsing functions to accept Union[str, os.PathLike, bytes, BinaryIO]
  2. Utilizing PyMuPDF's existing stream support for PDF processing
  3. Adding stream handling for text-based formats with proper encoding management
  4. Implementing document type detection for stream inputs
  5. Maintaining backward compatibility with existing file path functionality

Would love to hear thoughts on this feature request and happy to provide additional details. Would love to contribute to the implementation if the community is a fan of this idea.

jamesbraza commented 1 day ago

Yeah this sounds great, I agree with your rationale that it's quite useful to operate directly on streams. Feel free to open a PR contributing this.

https://github.com/Future-House/paper-qa/issues/599 is a related idea, talking about fsspec for general filesystems as opposed to just local ones.