boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
733 stars 154 forks source link

Optimize #add_referenced #238

Open pkmiec opened 6 months ago

pkmiec commented 6 months ago

Summary

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).

Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.

Performance

The benchmark was done with the following script,

require 'benchmark'
require 'combine_pdf'

puts "Ruby: #{RUBY_VERSION}"
puts "CombinePDF: #{CombinePDF::VERSION}"

files = []
68.times { |i| files << "/tmp/pdfs/#{i}.pdf" }
files = files * 10 # to exacerbate the effect

result_pdf = CombinePDF.new
files.each { |file| result_pdf << CombinePDF.parse(IO.read(file)) }
puts(Benchmark.measure do
    result_pdf.save('/tmp/combined.pdf')
end)

FYI ... we end up with 32053 pdf objects in @objects array.

Before

Ruby: 2.7.7 CombinePDF: 1.0.26 2.598427 0.011980 2.610407 ( 2.617881)

Ruby: 3.2.2 CombinePDF: 1.0.26 15.067833 0.026986 15.094819 ( 15.139298)

After

Ruby: 2.7.7 CombinePDF: 1.0.26 (with this PR) 2.768545 0.006937 2.775482 ( 2.786386)

Ruby: 3.2.2 CombinePDF: 1.0.26 (with this PR) 1.997242 0.016295 2.013537 ( 2.021782)

BenMorganMY commented 1 month ago

@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use .hash so that I can find other areas for improvement.

boazsegev commented 1 month ago

Using .hash instead of .eql? adds a new risk of Hash collisions – rare but definitely possible when squashing variable multi-byte objects into 64 bits (or, I believe that in Ruby, it would actually be limited to 62 bits).

This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications.

Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)?

pkmiec commented 1 month ago

Hi!

@BenMorganMY Have you already profiled your code and identified that it is still slow in this #add_reference method? I can imagine that given a different set of PDFs you may be hitting a slowness in a different part of the code. But to answer your question, the point of the method is to try to de-dup references to objects. I imagine that #eql needs to "walk" the object to compute whether it is the same as another object. Since the objects are Hashes then this walking becomes a recursive operation and thus more expensive. Perhaps the result of this "walking" was cached in earlier versions of Ruby and now it is not. Note, I could not find anything in Ruby's changelog, so I do not know for sure. Using #hash was a way to force that computation to produce a number and then compare numbers instead of Hashes.

@boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers.