cantaloupe-project / cantaloupe

High-performance dynamic image server in Java
https://cantaloupe-project.github.io/
Other
270 stars 111 forks source link

Feature request: Linearized PDFs #695

Open skairunner opened 3 months ago

skairunner commented 3 months ago

Currently Cantaloupe uses PDFBox, which does not support linearized PDFs. So whenever Cantaloupe wants to render tiles for a PDF file it has to download the entire file before it can render each individual page. As seen here https://github.com/cantaloupe-project/cantaloupe/issues/198 and here https://github.com/cantaloupe-project/cantaloupe/discussions/557 this is quite slow and impractical if PDFs get large. In our own testing, the cutoff point seems to be around 100MB, using the S3 backend.

It would be nice if Cantaloupe provided a new processor that supports linearized PDFs and, when provided with a random access-supported storage backend, can download only the page(s) it needs to render tiles. Unfortunately, PDFBox does not support this so an alternative PDF library will have to be used. From quick research, it seems there might not be many pure Java libraries that support this functionality. The C++ library qpdf supports linearized reading, though. This would complicate the build process and the processor might have to be distributed as an optional add-on.

DiegoPino commented 3 months ago

@skairunner I am aware PDFBox can not produce Linearized PDFs but (to my knowledge) it still can read/consume them. You sure your Linearized PDFs can not be delivered? Of course it can not stream/jump to a specific page which is a bottleneck. We have users with PDFs of 600+ Mbytes on S3 and Cantaloupe is able to generate Derivatives correctly but requires to have Source Cache around. Memory consumption is an issue but there are ways (pull might come soon from me) of reducing the memory consumption by enabled a PDFBox flag (subsampling) PDFRenderer.setSubsamplingAllowed(true) which might help. qpdf might be a solution but might require building a complete new processor, but most importantly discuss with the development team what the approach is for new processors architecture, since there has been a trend to move out of External Binaries (e.g imagemagic processor was removed) for handling derivatives.

skairunner commented 3 months ago

Yes, to my knowledge PDFBox doesn't have any problems with consuming linearized PDFs. In our tests, if Cantaloupe doesn't run out of memory it does deliver the tiles eventually. If the PDF file is not in the filesystem cache Cantaloupe has to download the entire large PDF file before generating tiles, which takes a while. The IIIF viewer we are using (Universal Viewer) requests several pages at once, which seems to make Cantaloupe request the same source multiple times and also open it multiple times in memory, which can kill it.

The ideal outcome is having a processor in Cantaloupe that can take advantage of linearized PDFs and deliver tiles for random pages with low latency and memory usage.

I definitely didn't intend this feature to be implemented in a speedy manner though. It seems like a large amount of work, after all, and for a fairly niche audience 😓 But having tracking issues is always a good thing and might help someone else down the road.

DiegoPino commented 3 months ago

hi @skairunner totally. IABookreader does the same as Universal Viewer which really does not help much. Thedownload the large file is an issue for other processors too that can not stream e.g OpenJPEG one for JP2000 but in general an issue with Remote storage (goes the same for a 1Gbyte+ non pyramidal TIFF). Will share the use case/need in our next week's call. Thanks!