apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
878 stars 206 forks source link

ThetaSketch update hangs #162

Closed hpx7 closed 6 years ago

hpx7 commented 6 years ago

Here's a minimal repro:

Union sketch = new SetOperationBuilder()
        .setNominalEntries(4096)
        .buildUnion();
for (int i = 0; i < 64; i++) {
    sketch.update(i);
}

// serialize + deserialize
sketch = Sketches.wrapUnion(WritableMemory.wrap(sketch.toByteArray()));

for (int i = 64; i < 128; i++) {
    sketch.update(i);
}

sketch.update(128); // this hangs

It seems to hang in HashOperations#fastHashSearchOrInsert

I'm guessing there is some state that isn't getting serialized?

hpx7 commented 6 years ago

After serialization + deserialization, it looks like the resize factor has dropped from 8 to 1.

Not clear if HashOperations#fastHashSearchOrInsert was really hanging forever or just taking a really long time. But this does seem problematic. Going to try and see if I can PR a fix.

jmalkin commented 6 years ago

Well, I found where resize factor isn't being set...

jmalkin commented 6 years ago

Fixing that works, and I have a hypothesis why. Not 100% sure, but it does fix that test. Will work on a PR this evening; in meetings today.

jmalkin commented 6 years ago

On further examination, I found a band-aid that stops the infinite loop but doesn't address the underlying issue. Maybe that's sufficient for now, but it's not a real fix.

The key issue is that we go into this else condition, despite what the comment says: https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/theta/DirectQuickSelectSketch.java#L258 Resize factor is x1 (meaning don't resize) and the current array can't grow anyway because there's no additional capacity in the memory. It'd heapify itself, which is fine, but the problem is that it's not supposed to get there in the first place.

So the infinite loop happens because it keeps searching the hash table forever. It's a unique value that doesn't collide, but there's no open slot, either. The question is why this isn't caught elsewhere.

jmalkin commented 6 years ago

Interestingly, this isn't an issue with heapifyUpdateSketch() because it ignores ResizeFactor 1X's specification not to resize and doubles the allocated size anyway.

Gotta figure out what the correct thing to do here is. Not going to commit a patch until we've decided on the correct behavior.

hpx7 commented 6 years ago

Thanks for jumping on this so fast @jmalkin!

jmalkin commented 6 years ago

@AlexanderSaydakov reminded me that creating a sketch with resize factor set to not resize means we allocate the entire size up front. That logic appears valid for both on- and off-heap sketches, so the only issue seems to be that resize factor was not properly written.

leerho commented 6 years ago

Actually, there are two problems. The first is that the resize factor was not properly set.

The second is that the HashOperations find() functions (written some time ago) use a while{} loop instead of a do{}. The do{} form I developed later (see the HLL sketches) to catch this very kind of state failure. The do{} loop saves the starting index and if the loop returns to that same starting index a state exception is thrown. There can be multiple causes for this state failure to happen. The incorrect setting of RF is one, corruption is the other.

jmalkin commented 6 years ago

@leerho: That's a good point, although properly setting resize factor is sufficient to (validly) avoid the infinite loop in this case.

heapifyUpdateSketch() wasn't impacted in this case because it would call resize() and ignore ResizeFactor.X1, forcing a size doubling. wrapUpdateSketch() simply assumed everything was properly configured. Neither of those is ideal; both places should probably be validating that sizes are correct if we have RF.X1 and an under-full sketch. That would prevent the (re)construction of a corrupt sketch in the first place.

Anyway, I have a PR for resize factor now. I'll look at the do() loop next.

jmalkin commented 6 years ago

@hpx7 The PR has been merged into master. We'd ideally like a few days to improve robustness checks elsewhere in the code before pushing a new release. Are you able to build locally from master for your use case?

leerho commented 6 years ago

Before we release I strongly suggest we look at all similar find() routines and convert the ones we can to the do{} form because it is so much more robust. Jon, not sure who you were addressing with your last sentence.

jmalkin commented 6 years ago

Lee, that was for the original reporter of the issue.

Given your preference for changing the various flavors of find() methods, I think we should also consider adding better self-consistency checks in heapify() and wrap().

leerho commented 6 years ago

Agreed.

hpx7 commented 6 years ago

Thanks for the patch @jmalkin. I'll wait for the next release.

jmalkin commented 6 years ago

I've now changed the hash table operations to check for infinite loops (will throw an exception), and heapify/wrap should also check that sufficient memory is allocated if you're not allowing the sketch to resize itself. So we should have fixed the root cause, have better protection against the underlying error slipping through, and will throw an exception if there's a detected problem (well, if you're just searching the hash table it'll return not found). All merged into master.

Closing this, although we'll discuss any other changes we want before pushing a new version.

hpx7 commented 6 years ago

Thanks @jmalkin. If you had to guess, what would you say is the eta for a new release?

jmalkin commented 6 years ago

@leerho is working on Improving test coverage in a few places that aren't as well tested yet are pretty critical. Depending on how fast that goes, Friday (US Pacific time) would be great. Otherwise we'll hope for early next week.

jmalkin commented 6 years ago

I should add that I tried speed tests with the infinite loop detection code, and any performance difference was not obvious. My fastest single trial run was with the new code so it's not obviously worse, although that run seemed anomalous. So I'm just going to claim no apparent degradation.

AlexanderSaydakov commented 6 years ago

sketches-core-0.10.2 has been released

hpx7 commented 6 years ago

Hmm deserialization of old ThetaSketches is now breaking for me:

Caused by: com.yahoo.sketches.SketchesArgumentException: Possible corruption: ResizeFactor X1, but provided array too small for sketch size
    at com.yahoo.sketches.theta.DirectQuickSelectSketch.writableWrap(DirectQuickSelectSketch.java:164)
    at com.yahoo.sketches.theta.UnionImpl.wrapInstance(UnionImpl.java:163)
    at com.yahoo.sketches.theta.SetOperation.wrap(SetOperation.java:155)
    at com.yahoo.sketches.theta.SetOperation.wrap(SetOperation.java:134)
    at com.yahoo.sketches.theta.Sketches.wrapUnion(Sketches.java:180)
...

Can we add a codepath that will correctly deserialize old versions and set the appropriate resize factor etc?

hpx7 commented 6 years ago

To repro, just serialize an empty ThetaSketch from 0.10.1 :

new SetOperationBuilder()
  .setNominalEntries(4096)
  .buildUnion()
  .toByteArray()

and then try to deserialize it:

byte[] data = ...
Sketches.wrapUnion(WritableMemory.wrap(data))
jmalkin commented 6 years ago

Right.. I know what the issue is. There isn't going to be a good conversion path without bumping the serialization version, which will require a bunch of other updates. So the question becomes what's the least bad trade off...

jmalkin commented 6 years ago

We debated options and went with the simplest approach. PR is awaiting review.

AlexanderSaydakov commented 6 years ago

We always discouraged storing sketches in updatable form (union has an updatable sketch inside), and we don't do it in our systems. As a result this bug apparently was not noticed since version 0.8.4.

@hpx7, could you describe your use case please? Why do you think you need to store serialized unions as opposed to compact sketches?

hpx7 commented 6 years ago

@AlexanderSaydakov I was following the recommendation in https://github.com/DataSketches/sketches-core/issues/148

EDIT: Ah I suppose we could/should have used the Union for the computation and then extracted and stored the CompactSketch once it was complete. I was not aware that storing sketches in updatable form was discouraged.

Also, appreciate the quick turnaround with the fix! Could we get a new release cut?

AlexanderSaydakov commented 6 years ago

sketches-core-0.10.3 has been released