Closed ianvkoeppe closed 4 years ago
Merging #4293 into master will increase coverage by
21.33%
. The diff coverage is92.68%
.
@@ Coverage Diff @@
## master #4293 +/- ##
=============================================
+ Coverage 45.95% 67.28% +21.33%
- Complexity 0 20 +20
=============================================
Files 1040 1043 +3
Lines 51662 51703 +41
Branches 7238 7238
=============================================
+ Hits 23739 34787 +11048
+ Misses 25892 14544 -11348
- Partials 2031 2372 +341
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...t/core/data/aggregator/ValueAggregatorFactory.java | 75% <ø> (+25%) |
0 <0> (ø) |
:arrow_down: |
...gregation/function/AggregationFunctionFactory.java | 85.1% <100%> (+65.1%) |
0 <0> (ø) |
:arrow_down: |
...gregation/function/customobject/SerializedHLL.java | 100% <100%> (ø) |
0 <0> (?) |
|
...tion/DistinctCountRawHLLMVAggregationFunction.java | 100% <100%> (ø) |
0 <0> (?) |
|
.../aggregation/function/AggregationFunctionType.java | 96.22% <100%> (+25.63%) |
0 <0> (ø) |
:arrow_down: |
...nction/DistinctCountRawHLLAggregationFunction.java | 85% <85%> (ø) |
0 <0> (?) |
|
...a/manager/realtime/RealtimeSegmentDataManager.java | 75% <0%> (-25%) |
0% <0%> (ø) |
|
...e/impl/dictionary/LongOnHeapMutableDictionary.java | 82.6% <0%> (-6.53%) |
0% <0%> (ø) |
|
...e/operator/dociditerators/SortedDocIdIterator.java | 55.55% <0%> (-5.56%) |
0% <0%> (ø) |
|
...impl/dictionary/FloatOffHeapMutableDictionary.java | 89.28% <0%> (-5.36%) |
0% <0%> (ø) |
|
... and 535 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f608b38...520a0e4. Read the comment docs.
Thanks for the contribution @ianvkoeppe. This introduces a high-level aggregation/keyword, so will be good to get input from @kishoreg on this.
Also, just wondering if there are other ways of getting this info. One possibility is that if you store the serializedHLL as a separate column already, you might be able to get it via a "select ... limit 1" on the column; I haven't played with serialized-hll columns before, so can't say for sure if this is valid - will check on it.
@sunithabeeram, thank you for the quick reply. I'm happy to speak more with @kishoreg about our use case.
Interestingly, we already are planning to use serialized HLLs, but still want Pinot to do the aggregation across multiple segments before returning the serialized HLL.
A more concrete example would be; given a Pinot table which has... Unique Page Views [SerializedHLL] PageId [Int]
...and I want to get the total unique page views for 10,000 pages. I can't supply those 10,000 page ids in a single query because URI length will be exceeded; ~especially when ids are represented as 40+ character urns~ [Edit: Remembered we don't store ids as urns in pinot, but number of ids and query length is still an issue]. So I batch calls by 1,000 pages and get 10 responses.
Today, I would have 10 responses with unique page views as a longs. I can't logically add them. With this change, I'm hoping I can get 10 serialized HLLs back, and use the HLLUtil client side to deserialize, aggregate, and then find cardinality.
@ianvkoeppe sounds good. Will let @kishoreg/@Jackie-Jiang chime in on this.
@kishoreg and @Jackie-Jiang, please let me know if there is any additional information you need on your end.
What is the problem with passing 10k page ids in the query. I understand it will be long but can we not use Post instead of GET.
Also, is the 10k page ids constant across queries? How did we select them.
Thank you for your questions @kishoreg.
What is the problem with passing 10k page ids in the query. I understand it will be long but can we not use Post instead of GET.
This is certainly an option but only fixes half the problem. Limiting the number of ids in the request can be just as much about limiting the response size as it is about limiting the request size. For our use case, there can be many demographic dimensions which match a given filter.
Also, is the 10k page ids constant across queries? How did we select them.
They aren't necessarily constant; the 10K ids was simply meant as a generalized example of our specific use case.
This is certainly an option but only fixes half the problem. Limiting the number of ids in the request can be just as much about limiting the response size as it is about limiting the request size. For our use case, there can be many demographic dimensions which match a given filter.
I didn't follow this. If we use POST, the query can be of any size right?
Overall, I think this is a good feature to have. However, I am unable to justify that the client side batching is more efficient in terms of network (bytes moved) and compute(the total time to process the query) than passing all the page ids within the query.
I think I can provide a more concrete example of what I meant, but there are other reasonable use cases which may be more persuasive.
Use Case 1: Response Size too Large.
In the case of having 10,000 IDs, let's say we group by a persons job title. There are many job titles, so for 10,000 unique IDs, we could have a very large response size which exceeds the amount of data which can be returned. We can improve performance by querying smaller groups of data by batching the IDs and then aggregating matching records client size. (In a way, this is working around not having the ability to page results when using aggregation functions).
Use Case 2: Cross Column or Cross Table Aggregations.
Say I have two tables; one tracks page views by member, and another tracks ad clicks by member. If I want to see unique members who visited my website and also clicked on by ad, I would not be able to query both and add their raw numbers. Assuming I'm using the same member id in both HLLs, I could merge the raw HLL responses from Pinot to achieve this.
Use case 2 makes sense. Use case 1 also makes sense but that can be achieved today without this change right?.
From the original description, I thought your requirement was something as follows -
select discountCountHLL(memberId) from T where pageId in (...... 10,000 IDs)
and you wanted to solve this by running multiple (10 in this case) queries of the form
select DISTINCTCOUNTRAWHLL(memberId) from T where pageId in (1000 IDs ...)
and you would then aggregate the raw HLL on the client side to get the actual distinctCount across all IDs.
Am I missing something? Just to clarify, I think this is a valid feature request to have in Pinot and I will review it shortly. But wanted to get some clarity on the scenarios where this can be used. Use case 2 is one such use case.
From the original description, I thought your requirement was something as follows - select discountCountHLL(memberId) from T where pageId in (...... 10,000 IDs) and you wanted to solve this by running multiple (10 in this case) queries of the form select DISTINCTCOUNTRAWHLL(memberId) from T where pageId in (1000 IDs ...)
The only thing missing is that we might also have a dimension to group by...
SELECT DISTINCTCOUNTRAWHLL(precomputedHLLColumn) FROM T WHERE pageId in (1000 IDS) GROUP BY JOB_TITLE
. Depending on the characteristics of the users visiting the pages, this could be tens of thousands of job titles. But I also mentioned, we have 50+ metrics in the table, so a request would be more like SELECT SUM(col1), SUM(col2), ..., DISTINCTCOUNTRAWHLL(...)
. So the combination of a large table with many IDs, and a large dimension (job titles) means the response size could be arbitrarily large unless we make the filter more strict by batching IDs.
Just to clarify, I think this is a valid feature request to have in Pinot and I will review it shortly.
Cool, glad to help explain our use case more thoroughly if we can help improve documentation or solutions for future devs. Also, interested if you think an existing solution works for this 1st use case.
@ianvkoeppe once you know the full spec of this feature going in, please add documentation in docs/pql_examples.rst, and include it in this PR.
Thanks.
@mcvsubbu, I added a little more color to the pql_examples.rst where I had mentioned the new functions. These are the only places I see HLLs mentioned, so it is the only place I updated.
Great idea @mcvsubbu. I added additional clarification on the HLL being represented as string and point to examples and use cases for handling the response.
The code looks good to me. I think it's better to return byte[] as hexString instead of string. HllUtil has a toBytes method. The HLLUtil.toString has some additional overhead that can be avoided. This will allow you to use HyperLogLog library directly to parse the byte[] in HyperLogLog.
Coming back to the original problem.
SELECT func(m1), func(m2)..... FROM T WHERE pageId in <1000+ values> GROUP BY job_title (high cardinality)
There are 4 possible options.
If possible always pick 1. After that, batching by job_title is the right solution since each response is mutually exclusive and the client can simply stitch the responses together without additional processing.
But what you are suggesting is solution 1 - batch by pageId. I am not sure why this will be better unless there is some relationship between pageId and jobTitle such that restricting pageId will automatically limit jobTitle.
Does this line of reasoning make sense?
@kishoreg, thanks for your feedback.
I think it's better to return byte[] as hexString instead of string. HllUtil has a toBytes method. The HLLUtil.toString has some additional overhead that can be avoided.
Awesome. I've converted it to use a HexString representation, and updated the tests accordingly.
But what you are suggesting is solution 1 - batch by pageId. I am not sure why this will be better unless there is some relationship between pageId and jobTitle such that restricting pageId will automatically limit jobTitle. Does this line of reasoning make sense?
Absolutely, we are definitely in agreement overall. I think not batching makes sense in any case possible. My example above may not exactly outline the use case for needing to batch and merge large responses. I still think it could exist. In any case, being able to merge across columns or tables which have HLLs and use the same key for the HLL is a more straightforward use case and justification.
Anything else you need on your end @kishoreg?
@kishoreg and @mcvsubbu, is there anything else you need to be able to sign off on this change? We are wanting to leverage this in the near future.
It looks good to me. Let's wait for another approval from @mcvsubbu or @Jackie-Jiang
…logs.
Client-side batching is a common requirement for reducing the query and response sizes to/from Pinot. In doing so, client's may then need to perform their own aggregation of results from the multiple responses returned. This is easy for sum, min, max, etc., but numerical values representing uniques cannot simply be added.
Supporting the DISTINCTCOUNTRAWHLL and DISTINCTCOUNTRAWHLLMV will allow clients to receive the serialized HLL and then aggregate them client-side.