Closed phstudy closed 4 years ago
Which release tag were you using for this example?
@phstudy Thank you for providing sufficient information & code so that we could actually understand and replicate your issue.
Your example is not a bug, in fact, given the two sketches you provided, the results are correct.
The fundamental issue is that you are intersecting two sketches, one estimating 264M uniques and the other 21K uniques, both with lgK = 12 or K=4096, .... but the intersection had only one retained item!
Intersections can produce much larger errors than normal sketching or unioning as is discussed on our website. The fundamental problem is that the intersection operation reduces your sample size and in this case it reduced your sample size down to only one sample! So you cannot expect reasonable accuracy with only one sample. In fact, you were lucky, it could have returned an estimate of zero. Would have been happier with that?
When doing intersections it is always a good idea to print out the upper and lower bounds along with the estimate. If you had done that, you would have discovered that the range of 95% confidence was (LB, Est, UB) = {1484, 65502, 366548}, which is huge!
The sketch is telling you that the true value of the intersection is somewhere between 1484 and 366,548! That is a clue that the sketch is not very confident of the result!
From the sketch you can also print out other information that tells you a great deal about what is going on inside the sketch. I added these extra print statements to your code (PhstudyTest below) so that you can learn to use these tools. The output from the modified test (PhstudyTestResults below) reveals a great deal about what your example is doing. Of course, the first really big clue is the line:
Retained Entries : 1
From both the preamble output as well as the sketch.toString() output, I could see that all of the sketches and the intersection are behaving quite normally. (Note: I did not include the long Base64 strings, since you already have them.)
Cheers, Lee.
public class PhstudyTest {
@SuppressWarnings("javadoc")
@Test
public void checkPhstudy() {
byte[] sketch1Arr = Base64.getDecoder().decode("<sketch1Base64>");
PreambleUtil.preambleToString(sketch1Arr);
final Memory serializedSketch = Memory.wrap(sketch1Arr);
Sketch sketch1 = Sketch.wrap(serializedSketch, DEFAULT_UPDATE_SEED);
println(Sketch.toString(sketch1Arr));
println(sketch1.toString());
byte[] sketch2Arr = Base64.getDecoder().decode("<sketch2Base64>");
final Memory serializedSketch2 = Memory.wrap(sketch2Arr);
Sketch sketch2 = Sketch.wrap(serializedSketch2, DEFAULT_UPDATE_SEED);
println(Sketch.toString(sketch2Arr));
println(sketch2.toString());
Intersection inter = SetOperation.builder().buildIntersection();
Sketch intSketch = inter.intersect(sketch1, sketch2);
println(intSketch.toString());
}
static void println(Object o) { System.out.println(o.toString()); }
}
PhstudyTest Results:
### SKETCH PREAMBLE SUMMARY:
Byte 0: Preamble Longs : 3
Byte 0: ResizeFactor : X1
Byte 1: Serialization Version: 3
Byte 2: Family : COMPACT
Byte 3: LgNomLongs : 0
Byte 4: LgArrLongs : 0
Byte 5: Flags Field : 00011010, 26
(Native Byte Order) : LITTLE_ENDIAN
BIG_ENDIAN_STORAGE : false
READ_ONLY : true
EMPTY : false
COMPACT : true
ORDERED : true
SINGLEITEM (derived) : false
Bytes 6-7 : Seed Hash : 93cc
Bytes 8-11 : CurrentCount : 4096
Bytes 12-15: P : 0.0
Bytes 16-23: Theta (double) : 1.5503161636074036E-5
Theta (long) : 142991427517005
Theta (long,hex) : 0000820cc93e3a4d
Preamble Bytes : 24
Data Bytes : 32768
TOTAL Sketch Bytes : 32792
### END SKETCH PREAMBLE SUMMARY
### DirectCompactOrderedSketch SUMMARY:
Estimate : 2.6420417306809786E8
Upper Bound, 95% conf : 2.726232570287611E8
Lower Bound, 95% conf : 2.5604410472331813E8
Theta (double) : 1.5503161636074036E-5
Theta (long) : 142991427517005
Theta (long) hex : 0000820cc93e3a4d
EstMode? : true
Empty? : false
Retained Entries : 4096
Seed Hash : 93cc | 37836
### END SKETCH SUMMARY
### SKETCH PREAMBLE SUMMARY:
Byte 0: Preamble Longs : 3
Byte 0: ResizeFactor : X1
Byte 1: Serialization Version: 3
Byte 2: Family : COMPACT
Byte 3: LgNomLongs : 0
Byte 4: LgArrLongs : 0
Byte 5: Flags Field : 00011010, 26
(Native Byte Order) : LITTLE_ENDIAN
BIG_ENDIAN_STORAGE : false
READ_ONLY : true
EMPTY : false
COMPACT : true
ORDERED : true
SINGLEITEM (derived) : false
Bytes 6-7 : Seed Hash : 93cc
Bytes 8-11 : CurrentCount : 4096
Bytes 12-15: P : 1.0
Bytes 16-23: Theta (double) : 0.19793567670940415
Theta (long) : 1825634385657445494
Theta (long,hex) : 1955f4cd16d9bc76
Preamble Bytes : 24
Data Bytes : 32768
TOTAL Sketch Bytes : 32792
### END SKETCH PREAMBLE SUMMARY
### DirectCompactOrderedSketch SUMMARY:
Estimate : 20693.591312562978
Upper Bound, 95% conf : 21283.962960450415
Lower Bound, 95% conf : 20119.498939949455
Theta (double) : 0.19793567670940415
Theta (long) : 1825634385657445494
Theta (long) hex : 1955f4cd16d9bc76
EstMode? : true
Empty? : false
Retained Entries : 4096
Seed Hash : 93cc | 37836
### END SKETCH SUMMARY
### HeapCompactOrderedSketch SUMMARY:
Estimate : 64502.97194045358
Upper Bound, 95% conf : 366548.2818845309
Lower Bound, 95% conf : 1484.0
Theta (double) : 1.5503161636074036E-5
Theta (long) : 142991427517005
Theta (long) hex : 0000820cc93e3a4d
EstMode? : true
Empty? : false
Retained Entries : 1
Seed Hash : 93cc | 37836
### END SKETCH SUMMARY
@jmalkin I use 1.3.0-incubating
@leerho Thanks for your detailed explanation and point out the documentation. I will try to improve the result by increasing K and using LB & UB.
Sample code:
Result: