Closed grdw closed 6 years ago
My vote would be on the following:
format_parser_pdf
or something like that. The gem would register itself via FormaParser.register_parser ...
Use pdf-reader gem, packaged in a separate "adapter" gem so that an explicit optional dependency can be made - with a "shim" adapter gem called format_parser_pdf [..]
So if I understand correctly:
pdf-reader
as a dependencyformat_parser.gemspec
What's the underlying idea behind this? To one day swap out the pdf-reader gem with something custom?
Not quite. The idea is to make a format_parser_pdf
gem that depends on both format_parser
and pdf_reader
. This way, if you want to have PDF support, you need to install that gem and pdf-reader
. If you don't need PDF detailed PDF support then you won't need pdf-reader
as a (rather large) dependency.
Makes sense. It does feel a bit like an in-between solution for:
Find a way to teach pdf-reader to skip over streams, because otherwise we are looking at a straight-ahead read of everything into memory, including potentially exploit-y bits such as inflating compressed streams etc.
Would you like met to pick up the format_parser_pdf
gem creation? Let me know.
Yes, please do 👍 Maybe also worthy to try to do the necessary overrides in pdf-reader first.
Maybe also worthy to try to do the necessary overrides in pdf-reader first
@julik What would be a good approach to do these overrides, in your opinion? Fork it to my github account or fork it to wetransfer's account and make the changes? Or just include the PDF reader gem and override the modules and classes within the gem itself?
nvm;
Or just include the PDF reader gem and override the modules and classes within the gem itself?
I went with this approach.
You can follow the progress here: https://github.com/WeTransfer/format_parser_pdf . I'm following the pdf-reader's code to see where the Stream
stuff is read and used. Thusfar I found these lines, but there might be more for all I know. I'll take a more thorough look later today. 🔎
The old (byte-by-byte)
page_count
implemenation has been removed from this gem and rightfully so. There are however a few alternatives to consider on how to move forward and reimplement this method.I did a very simple
Benchmark.ips
test* with 36 random PDF files, with 4 different libs/parsers and see which performed better, including the PDF parser as it is on the pdf-parser-rework branch.Results: In bulk (iterations = 100, so 100x36 = 3600 PDF's parsed per binary/lib)
All-in-all:
I'm a bit amazed by the fact that the PDFReader - although it's size as a dependency - is still almost 2 times as fast as the 'custom' PDFReader.
pdf_info
andimagemagick
both tried to 'automatically repair' the Xref tables for broken PDF's which took an insane amount of time forimagemagick
, less so forpdfinfo
but it still bubbles up.How to move forward?
There are a few options to consider:
page_count
I'm not sure which route is the best route to take in this case, but my preference now leans towards using the PDFReader. I don't know how you feel about it @julik, as you put most work in the custom PDF parser.
*Code I used for this benchmark: https://gist.github.com/grdw/dae430ffb3efe291c0be0c423a5b3597