Feature Request: Support for file ingestion via byte-streams

Feature Request: Add Support for Document Ingestion via Byte Streams

Current Behavior

Currently, paperQA's document reading/processing machinery (read_doc and related parsing functions) only supports file path-based document ingestion. This requires documents to be saved to the filesystem (or requires use of the tempfiles module as a workaround) before they can be processed.

Proposed Feature

Add support for processing documents directly from byte streams (bytes/BytesIO) in addition to the existing file path functionality. This would allow for processing documents from memory without having to jump through the filesystem operations overhead.

This would make paper-qa more amenable to integration into web applications and reduce I/O overhead in high-throughput scenarios.

Implementation Proposal

Extending the input types for read_doc and parsing functions to accept Union[str, os.PathLike, bytes, BinaryIO]
Utilizing PyMuPDF's existing stream support for PDF processing
Adding stream handling for text-based formats with proper encoding management
Implementing document type detection for stream inputs
Maintaining backward compatibility with existing file path functionality

Would love to hear thoughts on this feature request and happy to provide additional details. Would love to contribute to the implementation if the community is a fan of this idea.

Future-House / paper-qa