avast / ep-stats

Statistics for Experimentation Platform
MIT License
15 stars 11 forks source link

Do not require extra copy of dimensional goals aggregated data when evaluating without dimensions #74

Open ondraz opened 4 months ago

ondraz commented 4 months ago

E.g. when evaluating aggregated data of test-mutli-dimension experiment in test_multi_dimension, we require to have another copy of aggregated data without dimensional columns see here.

It would be nice just to "group by" dimensional data without the need to have extra aggregated data without dimensions in agg goals dataframe.

current data:

test-multi-dimension        a   test_unit_type  global  exposure            1000    1000    1000    1000    1000
test-multi-dimension        b   test_unit_type  global  exposure            1001    1001    1001    1001    1001
test-multi-dimension        a   test_unit_type  unit    view    button-1    p-1 200 200 200 200 200
test-multi-dimension        b   test_unit_type  unit    view    button-1    p-1 220 220 220 220 220
test-multi-dimension        a   test_unit_type  unit    view    button-1        100 100 100 100 100
test-multi-dimension        b   test_unit_type  unit    view    button-1        180 180 180 180 180
test-multi-dimension        a   test_unit_type  unit    view            300 300 300 300 300
test-multi-dimension        b   test_unit_type  unit    view            400 400 400 400 400

data format requested in this issue:

test-multi-dimension        a   test_unit_type  global  exposure            1000    1000    1000    1000    1000
test-multi-dimension        b   test_unit_type  global  exposure            1001    1001    1001    1001    1001
test-multi-dimension        a   test_unit_type  unit    view    button-1    p-1 200 200 200 200 200
test-multi-dimension        b   test_unit_type  unit    view    button-1    p-1 220 220 220 220 220
test-multi-dimension        a   test_unit_type  unit    view    button-1        100 100 100 100 100
test-multi-dimension        b   test_unit_type  unit    view    button-1        180 180 180 180 180
jancervenka commented 3 months ago

@ondraz Hi Ondro, I think doing this might be tricky because the dataframe would then need to contain all possible combinations of element and product dimension values for the view goal (for example rows for button-2 and p-2. Otherwise, the group by would produce aggregations with missing data.

ondraz commented 3 months ago

If we just aggregate (sum) these 4 lines of agg. goal data,

test-multi-dimension        a   test_unit_type  unit    view    button-1    p-1 200 200 200 200 200
test-multi-dimension        b   test_unit_type  unit    view    button-1    p-1 220 220 220 220 220
test-multi-dimension        a   test_unit_type  unit    view    button-1        100 100 100 100 100
test-multi-dimension        b   test_unit_type  unit    view    button-1        180 180 180 180 180

we get exactly what we already have in the extra two lines with no dim values:

test-multi-dimension        a   test_unit_type  unit    view            300 300 300 300 300
test-multi-dimension        b   test_unit_type  unit    view            400 400 400 400 400

so:

  1. goal count(test_unit_type.unit.view) - we can use four lines above and just sum them
  2. goal count(test_unit_type.unit.view(element=button-1) - we filter four lines above by dim value and sum values

There's probably some argument we did it this way where we require those extra two lines with empty dim data but I don't recall it.

jancervenka commented 3 months ago

You're right that it works in this case but I don't think it would work in general.

  1. Let's say there are 200 views with element = button-2 in the data that the DAO is selecting from.
  2. But because there is no goal count(test_unit_type.unit.view(element=button-2)) in the experiment metrics, the button-2 views will not show up in the data frame.
  3. Summing the dataframe rows with element = button-1 to produce count(test_unit_type.unit.view) will get us incorrect results because it will be missing the 200 button-2 views.
jancervenka commented 3 months ago

Also looking at the test-multi-dimension data, they kind of don't make sense. 😄

test-multi-dimension        a   test_unit_type  unit    view    button-1    p-1 200 200 200 200 200
test-multi-dimension        b   test_unit_type  unit    view    button-1    p-1 220 220 220 220 220
test-multi-dimension        a   test_unit_type  unit    view    button-1        100 100 100 100 100
test-multi-dimension        b   test_unit_type  unit    view    button-1        180 180 

For example, the third row contains all button-1 views from all products so its total count shouldn't be lower than the total count from the first row which represent button-1 views from p-1 product only. The first row views should be subset of the third row views.

I was just testing that the goal selection works correctly and didn't think about the specific values.