ToucanToco / weaverbird

A visual data pipeline builder with various backends
https://weaverbird.toucantoco.dev
BSD 3-Clause "New" or "Revised" License
97 stars 17 forks source link

'Aggregate' step to Mongo :: Support count distinct #30

Open vdestraitt opened 5 years ago

vdestraitt commented 5 years ago

TO DO


Exemple 1: Several aggregations on several dimensions

This 'group' step config in our VQB "language"...


 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "avg_value2"
           column: "value2"
           agg_function: 'avg'
        }
        {
           name: "MIN"
           column: "value3"
           agg_function: 'min'
        }
        {
           name: "MAX"
           column: "value4"
           agg_function: 'max'
        }
        {
           name: "count_value5"
           column: "value5"
           agg_function: 'count'
        }
        {
           name: "value6"
           column: "value6"
           agg_function: 'first'
        }
        {
           name: "value7"
           column: "value7"
           agg_function: 'last'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    avg_value2:
      $avg: '$value2'
    MIN:
      $min: "$value3"
    MAX:
      $max: "$value4"
    count_value5:
      $sum: 1 # Careful here: there is no '$count' operator, we '$sum' a series of 1 to count
    value6:
      $first: "$value6"
    value7:
      $last: "$value7"
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    avg_value2: 1
    MIN: 1
    MAX: 1
    count_value5: 1
    value6: 1
    value7: 1
]

Exemple 2: We keep the same generic approach for aggregations on one dimension

This 'group' step config in our VQB "language"...


 {
      name: 'group',
      on: ['dimension1']
      dimensions: [
        {
           name: "sum_value1"
           column: "value1"
           agg_function: 'sum'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1" # not "_id: '$dimension'" as it may be possible but we want to keep the approach as generic as possible so we use a "levelled" definition of "_id" as it generalises to grouping on several dimensions
    sum_value1:
      $sum: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    sum_value1: 1
]

Exemple 3: The unobvious 'count_distinct' in Mongo....

This 'group' step config in our VQB "language"...


 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client" # This first '$group' builds unique groups and therefore make every client unique in every combination of dimension1 x dimension2 x client
,
  $group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    count_distinct_client:
      $sum: 1 # Count the the unique groups and therefore the clients for every combination of dimension1 x dimension2
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    count_distinct_client: 1
]

Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !

Important notice: We show here how we can perform such aggregations in the mongo aggregation pipeline, but the query gets quite complicated and we do not even cover the challenge of performing several count_distinct at the same time... In such a case, the approach detailed below would need to be nested a number of times equal to the number of count_distinct aggregation needed !!! It quite quickly become unacceptable both in terms of readability and performance.

=> So we believe that we should switch to the potsprocess option as soon as the aggregation contains a 'count_distinct' combined with at least one other aggregation.

This 'group' step config in our VQB "language"...


 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "sum_value"
           column: "value"
           agg_function: 'sum'
        }
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client"
    array_detail:
      $push: '$$ROOT' # To keep track of the details and be able to perform every aggregation, we push every document ('$$ROOT' is a mongo variable similar to a cursor representing the current document being read) in arrays, with one array per group of [dimensions + the column for which we need to get the number of unique values]
,
$group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    client:
      $sum: 1 # here we count the unique clients as there are unique as per the group _id definition of the previous step
    array_group:
      $push: '$$ROOT' # Same thing here than in the step above: we need to keep track of every document to then be able to perform the other aggregations (see steps below)
,
  $unwind: '$array_group' # The unwind step "destructure" the array to get a root document per document found in the array => in tabular terms, we get 1 row per document found in the array
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    client: 1
    array: "$array_group.array_detail" # We get the array with the most granular level of documents here to unwind it in the following step to get a row per document
,
  $unwind: "$array"
,
  $group:
    _id:
    dimension1: "$dimension1"
    dimension2: "$dimension2"
    sum_value:
      $sum: "$array.value" # Now we can perform the other aggregations based on the most granular level
    count_distinct_client:
      $first: "$client" # And we need to keep the first value of count distinct found per group as this value was duplicated after the '$unwind' operations
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value: 1
    count_distinct_client: 1
]

Exemple 5: Several aggregations on the same column

This 'group' step config in our VQB "language"...


 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "my_name"
           column: "value1"
           agg_function: 'count'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    count_value1:
      $count: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    my_name: 1
]
adimascio commented 5 years ago

Hi @vdestraitt! Just nipticking on the step definition, I think I would vote for a simpler vocabulary such as:

{
   name: 'aggregate',
   on: ['column1', 'column2'],
   aggregations:  [
      {
          name: 'sum_value1'
          operation: 'sum'
          column: 'value1',
      }
   # ...
  ]

I wonder if each aggregation should not itself be a generic "function" step definition (to be defined) that could be reused out of the aggregation context and that would allow arbitrary expressions rather than just applying funciton on (one) column.

vdestraitt commented 5 years ago

Hi @adimascio ! I agree on the proposed vocabulary, I find it better ('on' and 'aggregations'). But I find simpler if an aggregation is just defined by 2 parameters: the target column and a basic operation (or 'agg_function' which I find maybe more explanatory).

Not sure to understand your comment about a generic function step definition. Does it mean that you would like to be able to define more complex aggregation functions (maybe a combination of aggregation functions and scalars for exemple ?) independently of the 'aggregate' step, and then call it when needed in any step needing a function ?

If so, I think it may be an interesting idea in the future but 100% of our historical use cases consist in applying basic aggregation functions (count, sum etc.). Also, this basic approach is enough to answer more complex use case as you can combine several 'basic' aggregation and project a formula.

Last comment: not every function that may be defined outside of an aggregation context can be used for aggregation purposes... So it may become tricky to explain to our end users.

adimascio commented 5 years ago

Does it mean that you would like to be able to define more complex aggregation functions (maybe a combination of aggregation functions and scalars for exemple ?) independently of the 'aggregate' step, and then call it when needed in any step needing a function ?

That is what I meant, yes but let's keep that for later.

vdestraitt commented 5 years ago

Still to be implemented: count_distinct => for next milestone