Integrating metric events

dannycoates / able

A/B testing service

http://dannycoates.github.io/able/

3 stars 4 forks source link

Integrating metric events #3

Open dannycoates opened 9 years ago

dannycoates commented 9 years ago

For A/B experiments to actually work we need a way to analyze the data, but before we can even do that we need a way to report the data.

Right now all that exists is able.report() which will give you some data for each experiment you're enrolled in; the experiment name and the independentVariables:values that were chosen for each subject (usually only one).

{
  "name":"myExperiment",
  "choices": {
    "1a4b2d54c3e": {
      "buttonText": "why not?"
    }
  }
}

That's useful for knowing how many subjects got each value, but its not enough to do anything. Somehow we need to link choices to events that are relevant to the experiment.

In the most naive way I think it would be nice to define which events my experiment is interested in tracking and then have the choices linked to those events.

Borrowing from Shane's example I'd like to add events

module.exports = {
  name: 'signInButtonTextMatters',
  hypothesis: 'The sign in button text affects signins',
  startDate: '2015-01-01',
  subjectAttributes: ['lang'],
  independentVariables: ['signInButtonText'],
  eligibilityFunction: function (subject) {
    return /en-US/.test(subject.lang);
  },
  groupingFunction: function (subject) {
    return {
      signInButtonText: this.uniformChoice([
        this.defaults.signInButtonText,
        'Come on in'
      ]);
    };
  },
  events: ['signInClicked']
};

So, whenever the signInClicked event fired, in addition to whatever it normally logs the "report" for that experiment would get logged so that we can correlate the choices with the event.

This means the experiment author will need to know what events are available to track, just as they need to know the independentVariables.

Gluing Able to whatever generates and logs these events will be an Issue for another day, but I'm wondering if a simple list of events is enough or do we need something more powerful?

@kparlante @shane-tomlinson

shane-tomlinson commented 9 years ago

@dannycoates, @kparlante - I'm thinking about a few things, it would be nice to make the correlation very explicit with minimal processing.

Consumer does all the work

Imaging us, as the consumer, doing something like:

render: function () {
   this.logScreen('signin');
   var signInButtonText = able.choose('signInButtonText');
   $('.signInButton').text(signInButtonText);
   this.logScreenEvent('ab.signInButtonText.' + signInButtonText);
   this._signInButtonText = signInButtonText;
  ...
},
...
onSignInButtonClick: function () {
  this.logScreenEvent('ab.success.signInButtonText.' + signInButtonText);
}

The downside is it's kind of sloppy. The upside is able doesn't need to change.

Able gives us a big hand

I am wondering if we could do something like:

render: function () {
   this.logScreen('signin');
   var signInButtonText = able.choose('signInButtonText');
   $('.signInButton').text(signInButtonText);
  ...
},
...
onSignInButtonClick: function () {
  able.success('signInButtonText');
}

Then, when we call able.results(), it would report something like:

{
  "signInButtonText": {
     "experiment": "myExperiment",
     "chosen": "come on in",
     "success": true
  }
}

In this case, able.results would only return results for variables for which able.choose was called.

This increases the scope of able, but reduces the amount of work on the consumer.

kparlante commented 9 years ago

@shane-tomlinson, @dannycoates: My gut is that whether or not an event fired is not going to be flexible enough to evaluate success/failure for many use cases. At first blush, I like Shane's proposal -- leaves the decision to the code and the api looks pretty simple/straightforward/easy to understand. That said, I can imagine cases where the success or failure of the experiment is determined by code/events that happen in a different context (e.g. the user eventually validates their email), or a set of events/conditions that is ugly/painful for the code to track (e.g. success if some events happened in a particular order without others happening). IIRC optimizely had a way to write an independent piece of code that could report results back on a particular experiment (e.g. query against some sql database of metrics). Once we have the services pipeline ingesting a wide variety of sources at mozilla, you could imagine heka filters that reported back to able -- the trick would be for that filter to know which experiment was run.

Shane's proposal sounds like a reasonable place to start -- I can imagine having multiple mechanisms to do the choice to outcome correlation.

dannycoates commented 9 years ago

@shane-tomlinson I think you're correct that able needs to be directly involved in the feedback. I'm not sure we are (err, I am) ready to talk about the "mechanism" yet, but there's a couple things about you're sketch that I'd like to consider.

Having the app make a call to Able, as you've got in the click handler, is probably the right way to report measurements. Able is already a dependency that we've opted into with choose, so I don't see a reason we need to hide or offload the data collection part.

I see you've directly tied the event[1] to the choice for signInButtonText, which I assume was intentional. This links two things I've been trying to keep separate so far, variables and events, in a way that tightly couples them. My "vision" :gags: so far has been that apps import variables and export events through Able and that experiments do the opposite, while the subject is the common thread between them. Apps and experiments should be able to develop fairly independently. Linking them by name in the app doesn't necessarily break that (nothing prevents us from having a variable and an event named the same thing) but it conflates their separate purposes in a way that might be confusing to future work.

Another goal was that app changes that involve Able can be "left in" for future experiments. For example, able.choose('signInButtonText') doesn't make any reference to a specific experiment so multiple experiments, simultaneously or over time, can change that variable without the app needing to change. I'd like to keep that same property with events. The sketch uses able.success('signInButtonText') which seems to break that principle because it both assumes that click event means success to every possible experiment and that the 'signInButtonText' is always relevant to that event. Of course for this experiment it is, but I think the experiment alone should have the power to define those things. I think we can fix that with a very slight modification:

onSignInButtonClick: function () {
  able.sendReport('signInClicked') 
  // able will correlate the subject and event to the proper experiment and choices
}

Now that I think about it, having a strawman to poke at makes it easier to think about the limitations and possibilities of analysis, so :beers: Overall I think your sketch is very close to the mechanism I want. If we can satisfy our analysis requirements with it I'll be very happy. @kparlante I've got another reply coming :)

[1] - events, measurements, stats, results... all synonyms in this context, we should pick one

dannycoates commented 9 years ago

Thanks @kparlante

My gut is that whether or not an event fired is not going to be flexible enough to evaluate success/failure for many use cases

I agree. My goal with this thread is to discover what data we need to collect/report so that some "other" system (or future Able) can do the anaylsis. Maybe events are the wrong word, but my idea in general is that as an experimenter I'm interested in measuring specific things at specific times, and from those measurements I can do my analysis :wave:

In this discussion so far, events combine both 'thing' and 'time', An experimenter can specify 'time' (event name) to collect, but is stuck with whatever 'thing' can be measured based on what data is tied to the event, they can't make arbitirary measurements. Limitation or feature?

Anyway, an experiment can collect data from any number of events and they will get tied to the subject and variables chosen by Able. So I imagine the data stream would look something like this:

{
  event: 'signInClicked',
  time: 1422303764406,
  experiment: 'signInButtonTextMatters',
  subjectId: '6feaa3f2fba05421da38003a6dba8f7a',
  choices: {
    signInButtonText: 'Come on in'
  },
  data: {
    // maybe additional fields the event decides to report?
    // For example:
    // able.sendReport('signInClicked', { termsAccepted: app.termsAccepted })
    termsAccepted: true
  }
}

I can imagine cases where the success or failure of the experiment is determined by code/events that happen in a different context [...]

If both contexts use the same subject Able should be able, heh :), to make the correlation in many cases. More complicated senarios may need some other help.

So, given data like above, can we do the analysis we need?

kparlante commented 9 years ago

@dannycoates, @shane-tomlinson oh ic, yeah I think "event" terminology was confusing me. "able_event", "experiment_event"?

My "vision" :gags: so far has been that apps import variables and export events through Able and that experiments do the opposite, while the subject is the common thread between them. Apps and experiments should be able to develop fairly independently.

:+1: I like this. By subject do you mean the experiment name or the subjectId (presumably an identifier for the user, in this limited context).

The sketch uses able.success('signInButtonText') which seems to break that principle because it both assumes that click event means success to every possible experiment and that the 'signInButtonText' is always relevant to that event. Of course for this experiment it is, but I think the experiment alone should have the power to define those things. I think we can fix that with a very slight modification:

:+1: To the reasoning and the proposed modification.

data: {
    // maybe additional fields the event decides to report?
    // For example:
    // able.sendReport('signInClicked', { termsAccepted: app.termsAccepted })
    termsAccepted: true
  }

Yes, seems like the ability to pass in additional information is useful.

So, given data like above, can we do the analysis we need?

Well, what's not clear to me from this scenario is what gets logged for the people who do not click on the button. Should the client code call able.report() to create an "event" when the user sees the button? Or do we presume the user has seen the button because able.choose() was called, and something gets logged for that?

dannycoates commented 9 years ago

By subject do you mean the experiment name or the subjectId

subjectId, which would usually correlate 1-1 with userId for authenticated sessions or sessionId for unauthed sessions.

Should the client code call able.report() to create an "event" when the user sees the button? Or do we presume the user has seen the button because able.choose() was called, and something gets logged for that?

I think we definitely want able.choose to emit an implicit event to record the choice. I think beyond that its probably up to the app devs and experimenters to figure out which events will work and if new ones should be added.

So,

what's not clear to me from this scenario is what gets logged for the people who do not click on the button

In this case we'd log the choice event and nothing else. Whether that's enough I don't know, if not the experiment could track another event to close the loop, 'pageUnload' maybe?

I imagine it will take some time to come up with good practices for designing experiments. After a few we'll probably be able to streamline some things to reduce boilerplate.

kparlante commented 9 years ago

@dannycoates

I think we definitely want able.choose to emit an implicit event to record the choice. I think beyond that its probably up to the app devs and experimenters to figure out which events will work and if new ones should be added.

Agreed; that should work as long as the call to choice() was aligned with the user actually seeing the choice. One can imagine scenarios where that wasn't true, but presumably a separate event could be logged explicitly if necessary.

I imagine it will take some time to come up with good practices for designing experiments. After a few we'll probably be able to streamline some things to reduce boilerplate.

Agreed.

Anyhow, I like the overall direction. :+1:

shane-tomlinson commented 9 years ago

I see you've directly tied the event[1] to the choice for signInButtonText, which I assume was intentional. This links two things I've been trying to keep separate so far, variables and events, in a way that tightly couples them. My "vision" :gags: so far has been that apps import variables and export events through Able and that experiments do the opposite, while the subject is the common thread between them.

I had to think, stop, think, stop, and then think some more about this. I think I see what you are trying to do - a combination of enable full-on multivariate testing, keep all logic related to experiments and defining their success/failure out of the consumer code, and the ability to leave experiment harness code in place to enable future experiments.

This is really powerful, and I can see the value for advanced testing. At the same time I'm very worried about complexity w.r.t. configuration and results for novices like myself trying to do a straight AB test with no other experiment interference.

For the common case, I'm trying really hard to convince myself the complexity is necessary, but I haven't been able to. Advanced events (events other than choice made and success) add a layer of indirection I'm not sold on the need for.

It seems like if one event affects multiple experiments (or variables), the same functionality can be achieved by reporting per-experiment/variable events.

So yes, the choice to directly tie the event to the choice for signInButtonText was intentional. Doing a straight forward AB test seems like one variable and one event can be intimately coupled. To me, this feels natural.

Another goal was that app changes that involve Able can be "left in" for future experiments.

For items that are frequently tested, yeah. For items that infrequently change, meh, seems like a smell similar to checking in commented out code. This is a bit orthogonal to how to do the correlation.

For example, able.choose('signInButtonText') doesn't make any reference to a specific experiment so multiple experiments, simultaneously or over time, can change that variable without the app needing to change. I'd like to keep that same property with events.

I can understand multiple experiments being able to define a value for the same variable - e.g., I imagine multiple experiments would be defined to test the best button text in 3 different languages. Its the events I'm not convinced of.

I think we definitely want able.choose to emit an implicit event to record the choice

I think this is the right way to go, I'm wondering how this will play out once results are gathered.

For an AB test we are comparing two or more variations of a single variable. If each variation is selected roughly an equal number of times, we can just count the total number of "success-like" events. No other events need to be counted.

Feature toggles where we want to count the % of people that make use the new feature is a bit different. We'll have to count the total number of "success" events and divide that by the total number of implicit events. That seems fine.

:beers:

shane-tomlinson commented 9 years ago

Nearly 3 months later...

@dannycoates - I've come around to the separation of the variables and events, as you have outlined. Now I have a lot of questions.

Where we left off:

Danny's proposed experiment definition

module.exports = {
  name: 'signInButtonTextMatters',
  hypothesis: 'The sign in button text affects signins',
  startDate: '2015-01-01',
  subjectAttributes: ['lang'],
  independentVariables: ['signInButtonText'],
  eligibilityFunction: function (subject) {
    return /en-US/.test(subject.lang);
  },
  groupingFunction: function (subject) {
    return {
      signInButtonText: this.uniformChoice([
        this.defaults.signInButtonText,
        'Come on in'
      ]);
    };
  },
  events: ['signInClicked']
};

Proposed event stream

{
  event: 'signInClicked',
  time: 1422303764406,
  experiment: 'signInButtonTextMatters',
  subjectId: '6feaa3f2fba05421da38003a6dba8f7a',
  choices: {
    signInButtonText: 'Come on in'
  },
  data: {
    termsAccepted: true
  }
}

Is choices an object to to allow multiple choices to be reported by one experiment? Correlating multiple events to the same user requires joining events on subjectId. Querying by event or experiment are straight forward.

I was thinking about the event stream in reverse, where the experiment is reported at the top level, and a stream of events are attached to it.

{
  experiment: 'signInButtonTextMatters',
  subjectId: '6feaa3f2fba05421da38003a6dba8f7a',
  choices: {
    signInButtonText: 'Come on in'
  },
  events: [
    {
       event: 'choice',
       time: 1422303754683
    },
    {
      event: 'signInClicked',
      time: 1422303764406,
      data: {
        termsAccepted: true
      }
    }
  ]
}

Organizing the results this way makes it easy to see the entire subject's event stream for a given experiment and say "In the signInButtonTextMatters experiment, X number of people saw the choice, Y number of people clicked the sign in button"

I suppose really, either format can be transformed into the other.

dannycoates commented 9 years ago

I think both formats have nice properties. The thing I like about the first one (the unbundled stream) is that each event can stand on its own so is more stream-like. The second is more compact, which would be better if we report events in bursts. Since they are equivalent we could do either depending on how we choose to transmit them. It seems like the bundled format fits how we're currently doing metrics.

I started sketching something up last week https://github.com/dannycoates/abatar/commit/66231e8e78a23eef5929cc4454d358fd1afdd6a8

I should have something usable this week that we can play around with.

vladikoff commented 9 years ago

Nearly 3 months later...

(I didn't read this whole thread...)

More months later we are thinking of removing able.report() from the content server and keep track of the experiment states ourselves. There is also a solution where instead of removing we will remove it from our DataDog tags and report able choose data as their own events, but I'm trying to find the value in that.