WeTransfer / format_parser

file metadata parsing, done cheap
https://rubygems.org/gems/format_parser
Other
62 stars 18 forks source link

Add page count back into the PDF parser #130

Closed grdw closed 6 years ago

grdw commented 6 years ago

The old (byte-by-byte) page_count implemenation has been removed from this gem and rightfully so. There are however a few alternatives to consider on how to move forward and reimplement this method.

I did a very simple Benchmark.ips test* with 36 random PDF files, with 4 different libs/parsers and see which performed better, including the PDF parser as it is on the pdf-parser-rework branch.

Results: In bulk (iterations = 100, so 100x36 = 3600 PDF's parsed per binary/lib)

                                                   user     system      total        real
a batch of PDF files - with custom parser       41.214933  14.356696  55.571629 ( 55.793098)
a batch of PDF files - with `pdfinfo` binary     0.335411   1.538438  83.209592 ( 91.608172)
a batch of PDF files - with pdf-reader gem      23.677051   5.400772  29.077823 ( 29.128269)
a batch of PDF files - with `imagemagick`      --- I aborted - this took to long ---

All-in-all:

  1. pdf-reader.
  2. Custom parser at pdf-parser-rework branch.
  3. pdfinfo
  4. imagemagick

I'm a bit amazed by the fact that the PDFReader - although it's size as a dependency - is still almost 2 times as fast as the 'custom' PDFReader.

pdf_info and imagemagick both tried to 'automatically repair' the Xref tables for broken PDF's which took an insane amount of time for imagemagick, less so for pdfinfo but it still bubbles up.

How to move forward?

There are a few options to consider:

  1. Use the PDFReader to extract the page_count
  2. Rework the custom PDF parser in such a fashion that it's faster than the PDFReader gem.
  3. Fork the PDFReader and strip the parts we don't need and end up with a custom parser?
  4. Something else?

I'm not sure which route is the best route to take in this case, but my preference now leans towards using the PDFReader. I don't know how you feel about it @julik, as you put most work in the custom PDF parser.


*Code I used for this benchmark: https://gist.github.com/grdw/dae430ffb3efe291c0be0c423a5b3597

julik commented 6 years ago

My vote would be on the following:

grdw commented 6 years ago

Use pdf-reader gem, packaged in a separate "adapter" gem so that an explicit optional dependency can be made - with a "shim" adapter gem called format_parser_pdf [..]

So if I understand correctly:

  1. Create a new gem
  2. Add pdf-reader as a dependency
  3. Include this new gem in the format_parser.gemspec
  4. ...

What's the underlying idea behind this? To one day swap out the pdf-reader gem with something custom?

julik commented 6 years ago

Not quite. The idea is to make a format_parser_pdf gem that depends on both format_parser and pdf_reader. This way, if you want to have PDF support, you need to install that gem and pdf-reader. If you don't need PDF detailed PDF support then you won't need pdf-reader as a (rather large) dependency.

grdw commented 6 years ago

Makes sense. It does feel a bit like an in-between solution for:

Find a way to teach pdf-reader to skip over streams, because otherwise we are looking at a straight-ahead read of everything into memory, including potentially exploit-y bits such as inflating compressed streams etc.

Would you like met to pick up the format_parser_pdf gem creation? Let me know.

julik commented 6 years ago

Yes, please do 👍 Maybe also worthy to try to do the necessary overrides in pdf-reader first.

grdw commented 6 years ago

Maybe also worthy to try to do the necessary overrides in pdf-reader first

@julik What would be a good approach to do these overrides, in your opinion? Fork it to my github account or fork it to wetransfer's account and make the changes? Or just include the PDF reader gem and override the modules and classes within the gem itself?

grdw commented 6 years ago

nvm;

Or just include the PDF reader gem and override the modules and classes within the gem itself?

I went with this approach.

You can follow the progress here: https://github.com/WeTransfer/format_parser_pdf . I'm following the pdf-reader's code to see where the Stream stuff is read and used. Thusfar I found these lines, but there might be more for all I know. I'll take a more thorough look later today. 🔎