Use object_id as hash key instead of whole object.

mattgibson commented 8 years ago

This prevents an error when adding a lot of PDFs together, which only seems to manifest when using Sidekiq. I only hit the error when combining over 140 PDFs and I still don't know quite what the root cause was apart from the eql? method got into a recursive loop when comparing the incoming objects with the existing keys. Running in the Rails console worked fine. Whatever the ultimate issue, this fixes it.

SystemStackError: stack level too deep
- 286 non-project frames
1
File "/app/vendor/bundle/ruby/2.2.0/gems/combine_pdf-0.2.31/lib/combine_pdf/pdf_protected.rb" line 36 in eql?
2
File "/app/vendor/bundle/ruby/2.2.0/gems/combine_pdf-0.2.31/lib/combine_pdf/pdf_protected.rb" line 36 in eql?
3
File "/app/vendor/bundle/ruby/2.2.0/gems/combine_pdf-0.2.31/lib/combine_pdf/pdf_protected.rb" line 36 in eql?
4
File "/app/vendor/bundle/ruby/2.2.0/gems/combine_pdf-0.2.31/lib/combine_pdf/pdf_protected.rb" line 36 in eql?

boazsegev commented 8 years ago

@mattgibson ,

Thank you for the suggestion and for exposing an issue (or limitation), however, I think this might not be a viable solution.

Allow me to explain...

By using the whole object as a key (vs. using object_id as a key), we are eliminating duplication of content, not just duplication of references.

i.e.: When two (or more) PDF objects that share the same font are unified, these font objects have the same data but they are different objects (font1 == font1_copy but font1.object_id != font1_copy.object_id). The function you edited makes sure to unify all font1 copies to a single object.

You should see noticeable file size differences when combining many PDF objects that share data (i.e., PDF files with the same font or the same pictures / logos embedded within).

I have no idea why recursion occurs, but it's definitely worth exploring... You say it only happens when using Sidekiq and I'm wondering what the difference between the two approaches might be. Is there any difference in the code executed?

peteygao commented 7 years ago

I just want to point out that the reason why this "only" happens under Sidekiq is because Sidekiq itself already has dozens of stack depth before it even hits the user-implemented def perform method, so your PDFs is likely on the edge of exhibiting this symptom. If your PDFs required anymore recursion, then it will fail even if you ran it in the Rails console (which likely has only a fewer layers of stack overhead than Sidekiq).

boazsegev commented 7 years ago

Hi Peter,

Thank you for bringing this up and for finding the root cause.

I'm keeping the PR open because I agree that we need to find a solution.

However, as I explained, we need to compare the actual data not the object_id. The code's job is to remove duplicate data and replace it with references.

If we compare object_ids than we are checking for references instead of checking for duplicate data (in different objects).

For this reason, this PR isn't ready for merging. We need to find a different solution / approach.

Especially now that you explained the root of the problem, I'm thinking it might be even more important to find a solution. You might know this, but the main Ruby thread should come with ~4Mb of stack memory (by default), while new threads should have only ~1Mb of stack memory (by default). This means that the issue isn't just sidekiq*, but every threaded implementation.

* (although if sidekiq uses a lot of the stack, it might be considered an actual sidekiq issue or design error, since sidekiq might run on a smaller stack to begin with)...

Could you test using the Object#hash method instead of eql?

I'm assuming hash will test the actual data, so it could probably replace the same functionality.

I was sure that eql? was using hash under the hood, but if it isn't, this might be an interesting workaround... kinda depends if hash will go as deep in the stack and whether it's recursive or flattened.

i.e.

@objects.each { |obj| existing[obj.hash] = obj }

...

         if obj.is_a?(Hash)
            referenced = obj[:referenced_object]
            if referenced && referenced.any?
             tmp = resolved[referenced.object_id] || existing[referenced.hash]
              if tmp
                obj[:referenced_object] = tmp
              else
                resolved[obj.object_id] = referenced
                existing[referenced.hash] = referenced
                should_resolve << referenced
                @objects << referenced
              end
           else

What do you think about this approach?

peteygao commented 7 years ago

(The words hash and Hash are so overloaded. I'm using lowercase h to mean the #hash computation used by eql?, while uppercase H means the Hash type)

@boazsegev Your solution will not work as eql? utilizes hash under the hood, so both eql? and hash will return the same equality value.

The root problem is that PDFs do NOT have a limit to how deeply nested its internal structure can be, whereas parser in general do have a stack depth (unless you're parsing using a stackless language). This is a fundamental design flaw with complex documents like PDFs (sighs).

Without the actual document to look at, here's my theory of what is likely going on: Because eql? has to traverse every value of the Hash (including nested Hashes) to compute its hash, it's possible that a particular PDF file that @mattgibson is using has an extremely high nesting factor, and calling eql? on a Hash with extremely deep nesting throws the stack level too deep exception because, well, there's simply not enough stack space to traverse the entire Hash to compute its hash!

There's no way around this if we stick with Ruby's internal eql? implementation. In order to compute the hash of a nestable object (Array, Hash, Set), every child value must be traversed. This is comparable to a "shallow copy" vs "deep copy".

One (slow) solution is to perform our own hash computation of Hashes by using an interative algorithm instead of a recursive one (i.e. perform everything in a huge while loop and pop the Hash key/values to the top level for processing).

peteygao commented 7 years ago

I hacked together a script to test the number of nested depth before Ruby runs out of stack space:

def compute_nest_depth
  h = {nest: {}}
  nest = h[:nest]
  i = 0

  while true
    i += 1
    puts i if (i-1) % 100 == 0
    puts i if i > 3001
    next_nest = { nest: {} }
    nest[:nest] = next_nest
    nest = next_nest[:nest]
    h.hash
  end
end

I ran it on my local Ruby 2.3.1 using both bin/rails c and just irb, I got similar results before the process either crashes (as in bin/rails c) or throws stack depth exception (as in irb): bin/rails c #=> 3040 irb #=> 3044 (4 more nests than bin/rails c...)

With Ruby 2.4.0, there's actually a less stack space as it throws stack depth exception sooner: irb #=> 3009

Now... what about Sidekiq, you ask? Well, I performed the same test in a Sidekiq runner and voila: 379

Yep. It crashed after "only" 379 stacks (Remember, it has 1/4th the amount of stack space because it's being run in a spawned thread, and plus it has its own Sidekiq runner overhead). That should still be enough for the vast majority of use cases, but there are the odd PDFs out there that have graphics and nested shapes, patterns, etc. exported from Photoshop, Illustrator or the like.

I guarantee you that Sidekiq will not fix this issue. Supporting recursive algorithms is not high on their priority list. Plus, with a high level language like Ruby, it's very hard to optimize for stack space usage anyways.

boazsegev commented 7 years ago

I was afraid that this might be the case... I guess I was hoping that if we take enough time, the next Ruby version, or maybe the Ruby 3x3 initiative, will flatten the underlying #hash algorithm so it isn't recursive.

I worked to minimize recursion in the gem, but I don't feel comfortable doing anything to prevent recursion in the underlying implementation.

I can't manage memory in Ruby as well as I could in C and I am scared I might end up writing code that, inadvertently, creates intermediate objects and consumes huge amounts of memory within a loop.

I'll keep this PR open just in hopes that inspiration will strike (either us or someone else who might end up reading this).

boazsegev commented 7 years ago

Wow... I ran your test and noticed that it's even worst than I thought...

I might be wrong, but it seems that this might be an actual Ruby layer issue and not just a stack size limitation. I don't think sidekiq can do anything to make it better.

Please tell me what you think about this:

Running the test you wrote in irb, I got either a catchable exception after 3044 iterations or and a segmentation fault 11 error once every other time... however, if the exception was raised, irb was still running and I could perform new tasks... (such as run the test again)... so I could theoretically handle failures inside my code (except every other time, it would all crash with the segmentation fault).

However, running your test in a new thread (Thread.new { compute_nest_depth } ) caused the whole Ruby interpreter (Ruby 2.4.0) to crash every single time, throwing me back to bash (this time with a C stack trace and a bur report)...

I believe this indicates an actual interpreter issue, or am I expecting too much?

peteygao commented 7 years ago

~Wow, this might be a legitimate segfault and will crash the entire interpreter! I believe this should definitely be reported to the Ruby maintainers and be fixed for the next release.~

~Instead of crashing, we expect it to throw SystemStackError: stack level too deep.~

peteygao commented 7 years ago

Actually, I just tested this on Ruby 2.4.0, and I was able to catch the exception every time without any Segfaults, hmm... Here's what I ran:

def compute_nest_depth
  h = {nest: {}}
  nest = h[:nest]
  i = 0

  while true
    i += 1
    puts i if (i-1) % 100 == 0
    puts i if i > 3001
    next_nest = { nest: {} }
    nest[:nest] = next_nest
    nest = next_nest[:nest]
    h.hash
  end
end

begin
  compute_nest_depth
rescue SystemStackError => e
  puts "Hello world!"
end

I then run it via ruby crash.rb and the output is a bunch of numbers (expected) follow by "Hello world!" (also expected). I ran this test over 10 times and I did not see a single segfault with C-stack traces 🤔...

Update: Here's my ruby -v: ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin15]

boazsegev commented 7 years ago

Oh, my bad, I'm actually using Ruby ruby 2.3.2p217 (2016-11-15 revision 56796) [x86_64-darwin16]...

The Ruby 2.4.0 is on the Ubuntu machine...

P.S.

I edited your comment to fix a spelling mistake (the function name was misspelled in the begin section.

For testing, I also place the begin section in a loop (irb only crashed on the second call).

boazsegev commented 7 years ago

Actually, it still crashes on my Ruby 2.4.0.

With Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64), ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux], this code will cause a core dump:

def compute_nest_depth
  h = {nest: {}}
  nest = h[:nest]
  i = 0

  while true
    i += 1
    puts "nested #{i}" if ((i & 511) == 0)
    next_nest = { nest: {} }
    nest[:nest] = next_nest
    nest = next_nest[:nest]
    h.hash
  end

rescue SystemStackError
  puts "Stack exploded at nesting #{i}"
end

counter = 0;
while(true)
  begin
    counter +=1
    puts "starting test #{counter}"
    compute_nest_depth
  rescue SystemStackError => e
    nil
  ensure
    puts "test #{counter} complete"
  end
end

The output shows that test 2 will cause the core to dump:

starting test 1
nested 512
nested 1024
nested 1536
nested 2048
nested 2560
Stack exploded at nesting 2783
test 1 complete
starting test 2
nested 512
nested 1024
nested 1536
nested 2048
nested 2560
Segmentation fault (core dumped)

However, the issue doesn't occur within a thread (unlike Ruby 2.3.2).

i.e., with the same function from before:

def compute_nest_depth
  h = {nest: {}}
  nest = h[:nest]
  i = 0

  while true
    i += 1
    puts "nested #{i}" if ((i & 511) == 0)
    next_nest = { nest: {} }
    nest[:nest] = next_nest
    nest = next_nest[:nest]
    h.hash
  end

rescue SystemStackError
  puts "Stack exploded at nesting #{i}"
end

counter = 0;
while(true)
   counter += 1
   thread = Thread.new { compute_nest_depth } 
   thread.join
   puts "completed test #{counter}"
end

Output:

...
Stack exploded at nesting 346
completed test 368
... (^C)

boazsegev commented 7 years ago

I can confirm that on my macOS, with ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16], the interpreter crashes.

I think the difference is that when you were testing crash.rb, the exception was only occurring once for every Ruby process.

boazsegev commented 7 years ago

P.S. I opened an issue to inform the Ruby core team.

peteygao commented 7 years ago

@boazsegev Thanks for opening that ticket!

boazsegev commented 7 years ago

@mattgibson and @peteygao

I thought I'd drop a note: I'm resolved the issue and I'm closing this PR.

Version 1.0.3 now includes a limited recursive equality test (limited to 3 layers of nested objects by default). This prevents the "stack level too deep" exception as well as speeds some things up.

Thanks for your patience on this one.

boazsegev / combine_pdf

Use object_id as hash key instead of whole object. #91