Closed edongashi closed 2 years ago
I also need the single AID version of isLowCount.
This made sense initially, but, after thinking about it for a while, I don't think it is the correct approach to the problem. Each histogram bar is a separate bucket, which we should treat like regular buckets. Meaning that we gather all associated AIDVs and use them together for applying the LCF.
Same thing for the count(distinct AID)
function: I am not convinced anymore that we can use the simplified version in the general case, when multiple AIDs are present.
Lets take. for example, the scenario of a banking dataset, where the AIDs are the SSN
and account_id
columns. The same person can have multiple account and an account can have multiple owners (shared ownership). If we want to count the number of distinct accounts, we still need to do flattening per person when summing contributions, in order to hide outlier individuals with many accounts, and vice versa.
Ok, I won't merge until we think more about this.
Each histogram bar is a separate bucket, which we should treat like regular buckets.
But in order to determine the histogram bin we need to know how many rows an AID has contributed. With multiple AIDs I'm not sure how to do the counting...
But in order to determine the histogram bin we need to know how many rows an AID has contributed. With multiple AIDs I'm not sure how to do the counting...
The same way we do count(distinct column)
, but replace column
with the counted AID.
I want to avoid the complex count distinct version in favor of a simple and specialized
count(distinct aid)
function. Because I find it hard to understand the flattening code, I need feedback on this particular function before moving on with the histogram aggregate. What you see here is my best guess which is probably wrong.I also need the single AID version of
isLowCount
.