Closed msageryd closed 3 years ago
I don't believe that this is the case, but i might be wrong.
PDF structure is quite complex, it uses indirection + some parts may or may not be compressed. The seek
operation seems mandatory for successful parsing, so the stream must be buffered. Moreover you must be able to seek through the whole document, therefore the buffer must be no less than the size of the document.
This is identical to reading the whole document from an input stream and then pushing the whole document to the output stream. You can use an in-memory document for this, but don't forget to set a size threshold to fallback to an in-file document to avoid OOM.
Thanks, I suspected something like that.
I would like to get some basic pdf information on the fly while streaming my files to S3.
Would it be possible to use PopplerDocument as a through stream so I can harvest pageCount and width,height for each page on the fly?