apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.2k stars 1.21k forks source link

Add support for an aggregation function returning serialized hyperlog… #4293

Closed ianvkoeppe closed 4 years ago

ianvkoeppe commented 4 years ago

…logs.

Client-side batching is a common requirement for reducing the query and response sizes to/from Pinot. In doing so, client's may then need to perform their own aggregation of results from the multiple responses returned. This is easy for sum, min, max, etc., but numerical values representing uniques cannot simply be added.

Supporting the DISTINCTCOUNTRAWHLL and DISTINCTCOUNTRAWHLLMV will allow clients to receive the serialized HLL and then aggregate them client-side.

codecov-io commented 4 years ago

Codecov Report

Merging #4293 into master will increase coverage by 21.33%. The diff coverage is 92.68%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #4293       +/-   ##
=============================================
+ Coverage     45.95%   67.28%   +21.33%     
- Complexity        0       20       +20     
=============================================
  Files          1040     1043        +3     
  Lines         51662    51703       +41     
  Branches       7238     7238               
=============================================
+ Hits          23739    34787    +11048     
+ Misses        25892    14544    -11348     
- Partials       2031     2372      +341
Impacted Files Coverage Δ Complexity Δ
...t/core/data/aggregator/ValueAggregatorFactory.java 75% <ø> (+25%) 0 <0> (ø) :arrow_down:
...gregation/function/AggregationFunctionFactory.java 85.1% <100%> (+65.1%) 0 <0> (ø) :arrow_down:
...gregation/function/customobject/SerializedHLL.java 100% <100%> (ø) 0 <0> (?)
...tion/DistinctCountRawHLLMVAggregationFunction.java 100% <100%> (ø) 0 <0> (?)
.../aggregation/function/AggregationFunctionType.java 96.22% <100%> (+25.63%) 0 <0> (ø) :arrow_down:
...nction/DistinctCountRawHLLAggregationFunction.java 85% <85%> (ø) 0 <0> (?)
...a/manager/realtime/RealtimeSegmentDataManager.java 75% <0%> (-25%) 0% <0%> (ø)
...e/impl/dictionary/LongOnHeapMutableDictionary.java 82.6% <0%> (-6.53%) 0% <0%> (ø)
...e/operator/dociditerators/SortedDocIdIterator.java 55.55% <0%> (-5.56%) 0% <0%> (ø)
...impl/dictionary/FloatOffHeapMutableDictionary.java 89.28% <0%> (-5.36%) 0% <0%> (ø)
... and 535 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f608b38...520a0e4. Read the comment docs.

sunithabeeram commented 4 years ago

Thanks for the contribution @ianvkoeppe. This introduces a high-level aggregation/keyword, so will be good to get input from @kishoreg on this.

Also, just wondering if there are other ways of getting this info. One possibility is that if you store the serializedHLL as a separate column already, you might be able to get it via a "select ... limit 1" on the column; I haven't played with serialized-hll columns before, so can't say for sure if this is valid - will check on it.

ianvkoeppe commented 4 years ago

@sunithabeeram, thank you for the quick reply. I'm happy to speak more with @kishoreg about our use case.

Interestingly, we already are planning to use serialized HLLs, but still want Pinot to do the aggregation across multiple segments before returning the serialized HLL.

A more concrete example would be; given a Pinot table which has... Unique Page Views [SerializedHLL] PageId [Int]

...and I want to get the total unique page views for 10,000 pages. I can't supply those 10,000 page ids in a single query because URI length will be exceeded; ~especially when ids are represented as 40+ character urns~ [Edit: Remembered we don't store ids as urns in pinot, but number of ids and query length is still an issue]. So I batch calls by 1,000 pages and get 10 responses.

Today, I would have 10 responses with unique page views as a longs. I can't logically add them. With this change, I'm hoping I can get 10 serialized HLLs back, and use the HLLUtil client side to deserialize, aggregate, and then find cardinality.

sunithabeeram commented 4 years ago

@ianvkoeppe sounds good. Will let @kishoreg/@Jackie-Jiang chime in on this.

ianvkoeppe commented 4 years ago

@kishoreg and @Jackie-Jiang, please let me know if there is any additional information you need on your end.

kishoreg commented 4 years ago

What is the problem with passing 10k page ids in the query. I understand it will be long but can we not use Post instead of GET.

Also, is the 10k page ids constant across queries? How did we select them.

ianvkoeppe commented 4 years ago

Thank you for your questions @kishoreg.

What is the problem with passing 10k page ids in the query. I understand it will be long but can we not use Post instead of GET.

This is certainly an option but only fixes half the problem. Limiting the number of ids in the request can be just as much about limiting the response size as it is about limiting the request size. For our use case, there can be many demographic dimensions which match a given filter.

Also, is the 10k page ids constant across queries? How did we select them.

They aren't necessarily constant; the 10K ids was simply meant as a generalized example of our specific use case.

kishoreg commented 4 years ago

This is certainly an option but only fixes half the problem. Limiting the number of ids in the request can be just as much about limiting the response size as it is about limiting the request size. For our use case, there can be many demographic dimensions which match a given filter.

I didn't follow this. If we use POST, the query can be of any size right?

Overall, I think this is a good feature to have. However, I am unable to justify that the client side batching is more efficient in terms of network (bytes moved) and compute(the total time to process the query) than passing all the page ids within the query.

ianvkoeppe commented 4 years ago

I think I can provide a more concrete example of what I meant, but there are other reasonable use cases which may be more persuasive.

Use Case 1: Response Size too Large.

In the case of having 10,000 IDs, let's say we group by a persons job title. There are many job titles, so for 10,000 unique IDs, we could have a very large response size which exceeds the amount of data which can be returned. We can improve performance by querying smaller groups of data by batching the IDs and then aggregating matching records client size. (In a way, this is working around not having the ability to page results when using aggregation functions).

Use Case 2: Cross Column or Cross Table Aggregations.

Say I have two tables; one tracks page views by member, and another tracks ad clicks by member. If I want to see unique members who visited my website and also clicked on by ad, I would not be able to query both and add their raw numbers. Assuming I'm using the same member id in both HLLs, I could merge the raw HLL responses from Pinot to achieve this.

kishoreg commented 4 years ago

Use case 2 makes sense. Use case 1 also makes sense but that can be achieved today without this change right?.

From the original description, I thought your requirement was something as follows -

select discountCountHLL(memberId) from T where pageId in (...... 10,000 IDs) and you wanted to solve this by running multiple (10 in this case) queries of the form select DISTINCTCOUNTRAWHLL(memberId) from T where pageId in (1000 IDs ...) and you would then aggregate the raw HLL on the client side to get the actual distinctCount across all IDs.

Am I missing something? Just to clarify, I think this is a valid feature request to have in Pinot and I will review it shortly. But wanted to get some clarity on the scenarios where this can be used. Use case 2 is one such use case.

ianvkoeppe commented 4 years ago

From the original description, I thought your requirement was something as follows - select discountCountHLL(memberId) from T where pageId in (...... 10,000 IDs) and you wanted to solve this by running multiple (10 in this case) queries of the form select DISTINCTCOUNTRAWHLL(memberId) from T where pageId in (1000 IDs ...)

The only thing missing is that we might also have a dimension to group by... SELECT DISTINCTCOUNTRAWHLL(precomputedHLLColumn) FROM T WHERE pageId in (1000 IDS) GROUP BY JOB_TITLE. Depending on the characteristics of the users visiting the pages, this could be tens of thousands of job titles. But I also mentioned, we have 50+ metrics in the table, so a request would be more like SELECT SUM(col1), SUM(col2), ..., DISTINCTCOUNTRAWHLL(...). So the combination of a large table with many IDs, and a large dimension (job titles) means the response size could be arbitrarily large unless we make the filter more strict by batching IDs.

Just to clarify, I think this is a valid feature request to have in Pinot and I will review it shortly.

Cool, glad to help explain our use case more thoroughly if we can help improve documentation or solutions for future devs. Also, interested if you think an existing solution works for this 1st use case.

mcvsubbu commented 4 years ago

@ianvkoeppe once you know the full spec of this feature going in, please add documentation in docs/pql_examples.rst, and include it in this PR.

Thanks.

ianvkoeppe commented 4 years ago

@mcvsubbu, I added a little more color to the pql_examples.rst where I had mentioned the new functions. These are the only places I see HLLs mentioned, so it is the only place I updated.

ianvkoeppe commented 4 years ago

Great idea @mcvsubbu. I added additional clarification on the HLL being represented as string and point to examples and use cases for handling the response.

kishoreg commented 4 years ago

The code looks good to me. I think it's better to return byte[] as hexString instead of string. HllUtil has a toBytes method. The HLLUtil.toString has some additional overhead that can be avoided. This will allow you to use HyperLogLog library directly to parse the byte[] in HyperLogLog.

Coming back to the original problem. SELECT func(m1), func(m2)..... FROM T WHERE pageId in <1000+ values> GROUP BY job_title (high cardinality)

There are 4 possible options.

  1. No batching, just get all the results in one shot. Works if the cardinality of job_title < 100k
  2. Batch by pageId
  3. Batch by job_title
  4. Batch by pageId and job_title (nested loop).

If possible always pick 1. After that, batching by job_title is the right solution since each response is mutually exclusive and the client can simply stitch the responses together without additional processing.

But what you are suggesting is solution 1 - batch by pageId. I am not sure why this will be better unless there is some relationship between pageId and jobTitle such that restricting pageId will automatically limit jobTitle.

Does this line of reasoning make sense?

ianvkoeppe commented 4 years ago

@kishoreg, thanks for your feedback.

I think it's better to return byte[] as hexString instead of string. HllUtil has a toBytes method. The HLLUtil.toString has some additional overhead that can be avoided.

Awesome. I've converted it to use a HexString representation, and updated the tests accordingly.

But what you are suggesting is solution 1 - batch by pageId. I am not sure why this will be better unless there is some relationship between pageId and jobTitle such that restricting pageId will automatically limit jobTitle. Does this line of reasoning make sense?

Absolutely, we are definitely in agreement overall. I think not batching makes sense in any case possible. My example above may not exactly outline the use case for needing to batch and merge large responses. I still think it could exist. In any case, being able to merge across columns or tables which have HLLs and use the same key for the HLL is a more straightforward use case and justification.

ianvkoeppe commented 4 years ago

Anything else you need on your end @kishoreg?

ianvkoeppe commented 4 years ago

@kishoreg and @mcvsubbu, is there anything else you need to be able to sign off on this change? We are wanting to leverage this in the near future.

kishoreg commented 4 years ago

It looks good to me. Let's wait for another approval from @mcvsubbu or @Jackie-Jiang