Open vdestraitt opened 5 years ago
Hi @vdestraitt! Just nipticking on the step definition, I think I would vote for a simpler vocabulary such as:
{
name: 'aggregate',
on: ['column1', 'column2'],
aggregations: [
{
name: 'sum_value1'
operation: 'sum'
column: 'value1',
}
# ...
]
I wonder if each aggregation should not itself be a generic "function" step definition (to be defined) that could be reused out of the aggregation context and that would allow arbitrary expressions rather than just applying funciton on (one) column.
Hi @adimascio ! I agree on the proposed vocabulary, I find it better ('on' and 'aggregations'). But I find simpler if an aggregation is just defined by 2 parameters: the target column and a basic operation (or 'agg_function' which I find maybe more explanatory).
Not sure to understand your comment about a generic function step definition. Does it mean that you would like to be able to define more complex aggregation functions (maybe a combination of aggregation functions and scalars for exemple ?) independently of the 'aggregate' step, and then call it when needed in any step needing a function ?
If so, I think it may be an interesting idea in the future but 100% of our historical use cases consist in applying basic aggregation functions (count, sum etc.). Also, this basic approach is enough to answer more complex use case as you can combine several 'basic' aggregation and project a formula.
Last comment: not every function that may be defined outside of an aggregation context can be used for aggregation purposes... So it may become tricky to explain to our end users.
Does it mean that you would like to be able to define more complex aggregation functions (maybe a combination of aggregation functions and scalars for exemple ?) independently of the 'aggregate' step, and then call it when needed in any step needing a function ?
That is what I meant, yes but let's keep that for later.
Still to be implemented: count_distinct
=> for next milestone
TO DO
count_distinct
Exemple 1: Several aggregations on several dimensions
This 'group' step config in our VQB "language"...
... should yield the following mongo aggregation pipeline:
Exemple 2: We keep the same generic approach for aggregations on one dimension
This 'group' step config in our VQB "language"...
... should yield the following mongo aggregation pipeline:
Exemple 3: The unobvious 'count_distinct' in Mongo....
This 'group' step config in our VQB "language"...
... should yield the following mongo aggregation pipeline:
Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !
Important notice: We show here how we can perform such aggregations in the mongo aggregation pipeline, but the query gets quite complicated and we do not even cover the challenge of performing several count_distinct at the same time... In such a case, the approach detailed below would need to be nested a number of times equal to the number of count_distinct aggregation needed !!! It quite quickly become unacceptable both in terms of readability and performance.
=> So we believe that we should switch to the potsprocess option as soon as the aggregation contains a 'count_distinct' combined with at least one other aggregation.
This 'group' step config in our VQB "language"...
... should yield the following mongo aggregation pipeline:
Exemple 5: Several aggregations on the same column
This 'group' step config in our VQB "language"...
... should yield the following mongo aggregation pipeline: