Closed mattgibson closed 7 years ago
@mattgibson ,
Thank you for the suggestion and for exposing an issue (or limitation), however, I think this might not be a viable solution.
Allow me to explain...
By using the whole object as a key (vs. using object_id
as a key), we are eliminating duplication of content, not just duplication of references.
i.e.: When two (or more) PDF objects that share the same font are unified, these font objects have the same data but they are different objects (font1 == font1_copy
but font1.object_id != font1_copy.object_id
). The function you edited makes sure to unify all font1
copies to a single object.
You should see noticeable file size differences when combining many PDF objects that share data (i.e., PDF files with the same font or the same pictures / logos embedded within).
I have no idea why recursion occurs, but it's definitely worth exploring... You say it only happens when using Sidekiq and I'm wondering what the difference between the two approaches might be. Is there any difference in the code executed?
I just want to point out that the reason why this "only" happens under Sidekiq is because Sidekiq itself already has dozens of stack depth before it even hits the user-implemented def perform
method, so your PDFs is likely on the edge of exhibiting this symptom. If your PDFs required anymore recursion, then it will fail even if you ran it in the Rails console (which likely has only a fewer layers of stack overhead than Sidekiq).
Hi Peter,
Thank you for bringing this up and for finding the root cause.
I'm keeping the PR open because I agree that we need to find a solution.
However, as I explained, we need to compare the actual data not the object_id
. The code's job is to remove duplicate data and replace it with references.
If we compare object_id
s than we are checking for references instead of checking for duplicate data (in different objects).
For this reason, this PR isn't ready for merging. We need to find a different solution / approach.
Especially now that you explained the root of the problem, I'm thinking it might be even more important to find a solution. You might know this, but the main Ruby thread should come with ~4Mb of stack memory (by default), while new threads should have only ~1Mb of stack memory (by default). This means that the issue isn't just sidekiq*, but every threaded implementation.
* (although if sidekiq uses a lot of the stack, it might be considered an actual sidekiq issue or design error, since sidekiq might run on a smaller stack to begin with)...
Could you test using the Object#hash
method instead of eql?
I'm assuming hash
will test the actual data, so it could probably replace the same functionality.
I was sure that eql?
was using hash
under the hood, but if it isn't, this might be an interesting workaround... kinda depends if hash
will go as deep in the stack and whether it's recursive or flattened.
i.e.
@objects.each { |obj| existing[obj.hash] = obj }
...
if obj.is_a?(Hash)
referenced = obj[:referenced_object]
if referenced && referenced.any?
tmp = resolved[referenced.object_id] || existing[referenced.hash]
if tmp
obj[:referenced_object] = tmp
else
resolved[obj.object_id] = referenced
existing[referenced.hash] = referenced
should_resolve << referenced
@objects << referenced
end
else
What do you think about this approach?
(The words hash
and Hash
are so overloaded. I'm using lowercase h
to mean the #hash
computation used by eql?
, while uppercase H
means the Hash
type)
@boazsegev Your solution will not work as eql?
utilizes hash
under the hood, so both eql?
and hash
will return the same equality value.
The root problem is that PDFs do NOT have a limit to how deeply nested its internal structure can be, whereas parser in general do have a stack depth (unless you're parsing using a stackless language). This is a fundamental design flaw with complex documents like PDFs (sighs).
Without the actual document to look at, here's my theory of what is likely going on:
Because eql?
has to traverse every value of the Hash
(including nested Hash
es) to compute its hash
, it's possible that a particular PDF file that @mattgibson is using has an extremely high nesting factor, and calling eql?
on a Hash
with extremely deep nesting throws the stack level too deep
exception because, well, there's simply not enough stack space to traverse the entire Hash
to compute its hash
!
There's no way around this if we stick with Ruby's internal eql?
implementation. In order to compute the hash
of a nestable object (Array
, Hash
, Set
), every child value must be traversed. This is comparable to a "shallow copy" vs "deep copy".
One (slow) solution is to perform our own hash
computation of Hash
es by using an interative algorithm instead of a recursive one (i.e. perform everything in a huge while loop and pop the Hash
key/values to the top level for processing).
I hacked together a script to test the number of nested depth before Ruby runs out of stack space:
def compute_nest_depth
h = {nest: {}}
nest = h[:nest]
i = 0
while true
i += 1
puts i if (i-1) % 100 == 0
puts i if i > 3001
next_nest = { nest: {} }
nest[:nest] = next_nest
nest = next_nest[:nest]
h.hash
end
end
I ran it on my local Ruby 2.3.1 using both bin/rails c
and just irb
, I got similar results before the process either crashes (as in bin/rails c
) or throws stack depth exception (as in irb
):
bin/rails c #=> 3040
irb #=> 3044
(4 more nests than bin/rails c
...)
With Ruby 2.4.0, there's actually a less stack space as it throws stack depth exception sooner:
irb #=> 3009
Now... what about Sidekiq, you ask? Well, I performed the same test in a Sidekiq runner and voila:
379
Yep. It crashed after "only" 379 stacks (Remember, it has 1/4th the amount of stack space because it's being run in a spawned thread, and plus it has its own Sidekiq runner overhead). That should still be enough for the vast majority of use cases, but there are the odd PDFs out there that have graphics and nested shapes, patterns, etc. exported from Photoshop, Illustrator or the like.
I guarantee you that Sidekiq will not fix this issue. Supporting recursive algorithms is not high on their priority list. Plus, with a high level language like Ruby, it's very hard to optimize for stack space usage anyways.
I was afraid that this might be the case... I guess I was hoping that if we take enough time, the next Ruby version, or maybe the Ruby 3x3 initiative, will flatten the underlying #hash
algorithm so it isn't recursive.
I worked to minimize recursion in the gem, but I don't feel comfortable doing anything to prevent recursion in the underlying implementation.
I can't manage memory in Ruby as well as I could in C and I am scared I might end up writing code that, inadvertently, creates intermediate objects and consumes huge amounts of memory within a loop.
I'll keep this PR open just in hopes that inspiration will strike (either us or someone else who might end up reading this).
Wow... I ran your test and noticed that it's even worst than I thought...
I might be wrong, but it seems that this might be an actual Ruby layer issue and not just a stack size limitation. I don't think sidekiq can do anything to make it better.
Please tell me what you think about this:
Running the test you wrote in irb
, I got either a catchable exception after 3044 iterations or and a segmentation fault 11
error once every other time... however, if the exception was raised, irb
was still running and I could perform new tasks... (such as run the test again)... so I could theoretically handle failures inside my code (except every other time, it would all crash with the segmentation fault).
However, running your test in a new thread (Thread.new { compute_nest_depth }
) caused the whole Ruby interpreter (Ruby 2.4.0) to crash every single time, throwing me back to bash
(this time with a C stack trace and a bur report)...
I believe this indicates an actual interpreter issue, or am I expecting too much?
~Wow, this might be a legitimate segfault and will crash the entire interpreter! I believe this should definitely be reported to the Ruby maintainers and be fixed for the next release.~
~Instead of crashing, we expect it to throw SystemStackError: stack level too deep
.~
Actually, I just tested this on Ruby 2.4.0, and I was able to catch the exception every time without any Segfaults, hmm... Here's what I ran:
def compute_nest_depth
h = {nest: {}}
nest = h[:nest]
i = 0
while true
i += 1
puts i if (i-1) % 100 == 0
puts i if i > 3001
next_nest = { nest: {} }
nest[:nest] = next_nest
nest = next_nest[:nest]
h.hash
end
end
begin
compute_nest_depth
rescue SystemStackError => e
puts "Hello world!"
end
I then run it via ruby crash.rb
and the output is a bunch of numbers (expected) follow by "Hello world!" (also expected). I ran this test over 10 times and I did not see a single segfault with C-stack traces 🤔...
Update: Here's my ruby -v
:
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin15]
Oh, my bad, I'm actually using Ruby ruby 2.3.2p217 (2016-11-15 revision 56796) [x86_64-darwin16]
...
The Ruby 2.4.0 is on the Ubuntu machine...
P.S.
I edited your comment to fix a spelling mistake (the function name was misspelled in the begin
section.
For testing, I also place the begin
section in a loop (irb
only crashed on the second call).
Actually, it still crashes on my Ruby 2.4.0.
With Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64), ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
, this code will cause a core dump:
def compute_nest_depth
h = {nest: {}}
nest = h[:nest]
i = 0
while true
i += 1
puts "nested #{i}" if ((i & 511) == 0)
next_nest = { nest: {} }
nest[:nest] = next_nest
nest = next_nest[:nest]
h.hash
end
rescue SystemStackError
puts "Stack exploded at nesting #{i}"
end
counter = 0;
while(true)
begin
counter +=1
puts "starting test #{counter}"
compute_nest_depth
rescue SystemStackError => e
nil
ensure
puts "test #{counter} complete"
end
end
The output shows that test 2 will cause the core to dump:
starting test 1
nested 512
nested 1024
nested 1536
nested 2048
nested 2560
Stack exploded at nesting 2783
test 1 complete
starting test 2
nested 512
nested 1024
nested 1536
nested 2048
nested 2560
Segmentation fault (core dumped)
However, the issue doesn't occur within a thread (unlike Ruby 2.3.2).
i.e., with the same function from before:
def compute_nest_depth
h = {nest: {}}
nest = h[:nest]
i = 0
while true
i += 1
puts "nested #{i}" if ((i & 511) == 0)
next_nest = { nest: {} }
nest[:nest] = next_nest
nest = next_nest[:nest]
h.hash
end
rescue SystemStackError
puts "Stack exploded at nesting #{i}"
end
counter = 0;
while(true)
counter += 1
thread = Thread.new { compute_nest_depth }
thread.join
puts "completed test #{counter}"
end
Output:
...
Stack exploded at nesting 346
completed test 368
... (^C)
I can confirm that on my macOS, with ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
, the interpreter crashes.
I think the difference is that when you were testing crash.rb
, the exception was only occurring once for every Ruby process.
@boazsegev Thanks for opening that ticket!
@mattgibson and @peteygao
I thought I'd drop a note: I'm resolved the issue and I'm closing this PR.
Version 1.0.3 now includes a limited recursive equality test (limited to 3 layers of nested objects by default). This prevents the "stack level too deep" exception as well as speeds some things up.
Thanks for your patience on this one.
This prevents an error when adding a lot of PDFs together, which only seems to manifest when using Sidekiq. I only hit the error when combining over 140 PDFs and I still don't know quite what the root cause was apart from the
eql?
method got into a recursive loop when comparing the incoming objects with the existing keys. Running in the Rails console worked fine. Whatever the ultimate issue, this fixes it.