apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
893 stars 209 forks source link

tuning theta sketch #414

Closed patelprateek closed 1 year ago

patelprateek commented 2 years ago

I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . Are there any recommended good practices here around how to do these set operations , what kind of operations are not suited for this sketch I was thinking along the line if i can create a sharded sketch for some columns with larger cardinality , but unclear on what would be a reasonable threshold on sizing these sets , preferably auto tune these params while indexing . any pointers or ideas are appreciated

AlexanderSaydakov commented 2 years ago

It is not clear from your description what are you trying to do.

patelprateek commented 2 years ago

Sorry about not being clear.

I am trying to create theta sketches for my dataset for various dimensions that are used for indexing. Now for certain filtering queries I need to take set intersection to get cardinality estimation. AFAIK theta sketches are the most optimal for set intersection operation . The cardinalities of these sets are very dynamic , can range from (10 - 10 million) . Taking intersection of two theta sketch where cardinality difference is high , returns in very imprecise answers . Similarly repeated intersections cause the error rate to go high. I also read in some other blogs online that this is a known issue with theta sketches. Wanted to get some insights on possible workarounds , or may be some other sketch I can look into.

AlexanderSaydakov commented 2 years ago

Yes, I believe that Theta sketch is state of the art for approximate intersections. The problem is not quite that "cardinality difference is high". A better way to describe this is that Jaccard similarity is very low. It can happen with intersection of large sets with very small overlap too. I am afraid that this is a fundamental problem with approximate set operations. You could try improving accuracy by increasing sketch size, which sort of brings you closer to brute-force "exact" solution. On the other hand, if the overlap of two sets is very small (orders of magnitude smaller than the sets), is it really so important that relative accuracy is bad? Say, intersection of two sets with billion items is one hundred items. Even if the answer is 100% off (say, true answer is 50). Ask yourself whether it is a problem in practice? How much would you pay to have better accuracy?

patelprateek commented 2 years ago

Thanks for the explanation. Fair enough that practically for low jaccard similarity we can tolerate high error rate. Curious , if the problem exist for union operation as well ? For example intersection(A, B) , could be not (union(not(A), not(b))). Would re-writing the filter and using union help in this case ? If yes , then would HLL++ be better choice for sketch compared to theta as it has lower memory footprint , lower error rate and native support for union operation ?

AlexanderSaydakov commented 2 years ago

You seem to refer to the so-called inclusion-exclusion principle: | Union(A,B) | = | A | + | B | - | Intersection(A,B) | Yes, it is possible to use HLL sketch and estimate the size of the intersection using this equation. However the accuracy of this approach is much worse than Theta intersection.

AlexanderSaydakov commented 2 years ago

There is a plot on the page below comparing relative error of this include-exclude estimation with Theta intersection:

https://datasketches.apache.org/docs/Theta/ThetaAccuracyPlots.html

AlexanderSaydakov commented 2 years ago

Notice formulas there. Relative error of include-exclude is inversely proportional to Jaccard similarity (proportional to dissimilarity). Relative error of Theta intersection is inversely proportional to the square root of Jaccard similarity.

patelprateek commented 2 years ago

yes inclusion exclusion would be one way to compute intersection , i was suggesting the de-morgans law if we were able to store the negation of the sets as well . I thought the reason inclusion-exclusion didnt work well was because it has 4 approximation error components : |a| , |b| , union(a, b) and then adding subtracting these components , but if we were to do just one union , HLL++ usually work really well , so i was thinking if i can store HLL for a and HLL for ~a (not(a)) , i might be able to get away with just one union , which works well for hll ?

patelprateek commented 2 years ago
Given any two sets, A and B, the intersection can be defined from the set theoretic Include/Exclude formula (A∩B) = A + B - (A∪B). Unfortunately, for stochastic processes, each of these terms have random error components that always add. 

This made me think inclusion exclusion probably is sub optimal compared to negation approach which just requires one union

intersection (a, b) = not ( union (not(a), not(b))) 

it is easy to compute hll of negations in a stream

AlexanderSaydakov commented 2 years ago

De Morgan's laws are about boolean algebra. I don't see how that applies here.

patelprateek commented 2 years ago

https://www.cuemath.com/data/de-morgans-law/ : De Morgan's Law of Intersection: The complement of the intersection of A and B will be equal to the union of A' and B'. This condition is called De Morgan's law of Intersection. It can be given by (A ∩ B)’ = A’ ∪ B’.

AlexanderSaydakov commented 2 years ago

I think I see what you mean. The problem is with complements. The example on that page you referred to defines a very small and known universe. How do you imagine preparing Theta or HLL sketches representing complements of A and B for your practical application?

patelprateek commented 2 years ago

For my indexing job , the universe is finite for example each worker works on partition of data that is ~10-30 Million documents. So computing HLL(a) and HLL(~a) can be done as it process all these unique documents within this segment universe. for example my segment has 10 mill unique docs) if HLL(~a) = 1 mill , HLL(~b) is 2 mill , i can take the union (~a, ~b) , lets say it is 2.5 mill , then intersection (a,b) would be 7.5 i.e 10 (universe) - 2.5 (union of complements of a,b) Does that make sense or am i missing something ?

AlexanderSaydakov commented 2 years ago

computing HLL(a) and HLL(~a) can be done as it process all these unique documents within this segment

I don't understand this. How do you propose to construct a sketch of the complement exactly?

patelprateek commented 2 years ago

for example 10 million docs with a column say country . to create HLL(country=us) , add all document ids that have column country = US to create HLL(country=~us) , add all document ids that have column country = ~US Perhaps i am misunderstanding something ?

AlexanderSaydakov commented 2 years ago

Suppose we have HLL(~A) and HLL(~B). We can compute union. How do we compute complement of the sketch representing the union?

patelprateek commented 2 years ago

I thought we dont really need to take the complement of union . Since we want to do cardinality estimation

The cardinality of the complement would be cardinality(universe) - cardinality(union(HLL(~A) and HLL(~B))).

AlexanderSaydakov commented 2 years ago

Hmm... Perhaps this might work.

leerho commented 2 years ago

Interesting. But I am confused by your use of the word "unique" here:

So computing HLL(a) and HLL(~a) can be done as it process all these unique documents within this segment universe.

If your universe consists of all unique documents (no duplicates) than the intersection of any two disjoint subsets will always be zero! Clearly I am missing something.

patelprateek commented 2 years ago

intersection of any two disjoint subsets will always be zero : this is true regardless of duplication or universe size. The main idea we were exploring is how we can compute Intersection(A, B) cardinality approximately , since intersection(Theta_sketch(A) , theta_sketch(B) ) can suffer from high error rate when jaccard similarity is low. One approach compared to above was : 1). Intersection(A, B) cardinality approx = Cardinalty (Union(HLL(A) , HLL(B)) - (HLL(A) + HLL(B)) ) : This seems to perform worse as @AlexanderSaydakov mentioned and shared pointers to some experiment results The other idea we were talking about is 2) Intersection(A, B) cardinality approx = Cardinlaity of universe - Cardinalty (Union(HLL(~A) , HLL(~B))) : This can work if we are able to compute universe cardinality , and complement cardinality . since unions seem to work well with HLL , i would be interested to see how the error compares with above approaches when feasible

leerho commented 2 years ago

So if you are computing the intersection of two overlapping sets, both drawn from the universe of unique items, you must be using other dimensions to select the two sets. Is this correct?

patelprateek commented 2 years ago

Correct. For some more context , I have a bunch of documents (order of millions, de-duplicated, unique document ids) , I am building inverted index , i can use the terms (or dimensions) used to build index to also build a small efficient cardinality estimator for the query planner . For some queries instead of running a full inverted index query i can use this data to get some cardinality estimation to decide whether to query index or not . I dont know all the query filters , but given the indexing fields i can build the corresponding sketches and then run some fast set operation to estimate the amount of data returned by a particular query filter. Could possibly do roaring bitmaps as well , but i assume theta sketch or HLL would be much lower memory footprint .

leerho commented 2 years ago

I think I follow. Your situation is a bit unusual in that you have a universe set that is already deduped. Given that, then your proposal might work.

However, allow me to use a "hand-wavy" analysis of error:

If A and B are small compared to the Universe then ~A and ~B will both be quite large. Suppose U is 1E7 and A and B are on the order of 1E5, and you set the K of the sketches to give you an RSE of 1%.

Now you scan all 1E7 items and create your sketches ~A and ~B, both of which will still be on the order of 1E7. The union of ~A and ~B will also have a cardinality of about 1E7 with a RSE of 1% or about 100K. To obtain your intersection you subtract this union from the Universe of 1E7. The error of your union kind of swamps out the size of A and B and it seems quite possible that the subtraction of Universe (1E7) - (~A U ~B) could produce negative numbers, depending on the probabilistic direction of the errors.

Did I miss something?

AlexanderSaydakov commented 2 years ago

In other words, you replace error O(1/sqrt(Jaccard(A, B))) with, say, 1% of the Universe size. Not sure which one is better.

patelprateek commented 2 years ago

Interesting , it can be possibly negative .

AlexanderSaydakov commented 2 years ago

to be precise: union RSE ~= 1/sqrt(K) intersection RSE ~= 1/sqrt(K) * 1/sqrt(Jaccard(A,B)) But this is relative error. To get absolute error we multiply by the size of the Universe (almost, assuming complements are not much smaller than the Universe) for union approach, and by the size of the resulting intersection for Theta intersection approach. For small intersections (and very low Jaccard similarity) even a horrible, say, 1000% relative error converted to absolute error still can happen to be smaller than 1% of the Universe size. Of course, this heavily depends on the actual data.

leerho commented 2 years ago

I think it might be worse. Using my plot on the website, which was computed with K=64K sketches that have a normal RSE of about 1/sqrt(2^16) = 1/2^8 =.004, and double that for 95% confidence, which is 0.8%.

If the difference cardinality is 1:100, then reading vertically from the 1E+02 point on the X-axis says that the resulting error of two sketches that have that inverse Jaccard ratio, will be about 10X worse than the native RSE of the sketch. So instead of 0.8% error, the resulting error of the intersection will be about 8%. Comparing that to my example above seems to indicate that just doing a regular intersection of the sketches A and B might be a great deal more accurate than your DeMorgan's approach. However, your mileage will vary! I would definitely try some experiments first :)

leerho commented 2 years ago

Plus, the intersection computed using the Theta sketch intersection, will never be negative.

patelprateek commented 2 years ago

Thank you for the detailed explanation , i will try out few experiments on my side as well

leerho commented 2 years ago

I think there are two fundamental problems with the DeMorgan approach.

So, in effect, this DeMorgan's approach may not be any more accurate than the Inclusive/Exclusive approach, which has similar problems.

patelprateek commented 2 years ago

Is my understanding correct regarding the DeMorgan approach we were discussing her :

leerho commented 2 years ago

A statement I made above was misleading. I said:

Thus your relative error is also proportionately larger.

I should have said "your absolute error is proportionately larger."

The relative error of these sketches is a constant and determined by the user specifying the parameter K (or Log K).
Using the example I gave above, a unique counting sketch with a Log_2 K of 16 will have a relative error of about 0.8 % with a 95% confidence. This Relative Error is constant no matter how big the sketch gets. But the Absolute Error is a function of the sketch size. Populating this sketch with 1 M items with a relative error of 0.8% means that your cardinality estimate will be in the range of 1M (1 +/- .008) or +/- 8K. Populating this sketch with 10M items means that your cardinality estimate will be 10M (1 +/- .008) or +/- 80K. Thus, the absolute error is proportional to the size of the sketch, but the relative error is constant.

Because of my mistake, you concluded:

high cardinality sets because they have high relative error

Which is not correct.

Cheers

AlexanderSaydakov commented 2 years ago

But the Absolute Error is a function of the sketch size

I think a better way to say this would be that it is a function of the size of the set that sketch represents (cardinality).

leerho commented 2 years ago

Agreed. :)

patelprateek commented 2 years ago

Thanks for the explanation , makes much more sense to me now