Open steenfuentes opened 1 day ago
Yeah this sounds great, I agree with your rationale that it's quite useful to operate directly on streams. Feel free to open a PR contributing this.
https://github.com/Future-House/paper-qa/issues/599 is a related idea, talking about fsspec
for general filesystems as opposed to just local ones.
Feature Request: Add Support for Document Ingestion via Byte Streams
Current Behavior
Currently, paperQA's document reading/processing machinery (
read_doc
and related parsing functions) only supports file path-based document ingestion. This requires documents to be saved to the filesystem (or requires use of thetempfiles
module as a workaround) before they can be processed.Proposed Feature
Add support for processing documents directly from byte streams (bytes/BytesIO) in addition to the existing file path functionality. This would allow for processing documents from memory without having to jump through the filesystem operations overhead.
This would make paper-qa more amenable to integration into web applications and reduce I/O overhead in high-throughput scenarios.
Implementation Proposal
read_doc
and parsing functions to acceptUnion[str, os.PathLike, bytes, BinaryIO]
Would love to hear thoughts on this feature request and happy to provide additional details. Would love to contribute to the implementation if the community is a fan of this idea.