Watts-Lab / experiment-dispatcher

A tool for picking and launching the next experiment
MIT License
0 stars 0 forks source link

Group composition and condition selection #3

Open JamesPHoughton opened 2 years ago

JamesPHoughton commented 2 years ago

Background to the problem

I have a set of interventions of varying efficacy, and I want to choose the best one for a given situation. For example, perhaps I have a set of party games and I want to recommend one that will maximize the average happiness of the group.

The different interventions have different levels of effectiveness for different groups. Large groups may prefer "I Spy", and small groups may prefer "Gin Rummy". Groups with a mean age in their 50s may prefer "Charades", while groups with a mean age of 9 may prefer "go fish".

What I want to do is build a map that tells me which intervention is the most effective for a given group. For example, given "Group 1" composed of a 5, 7, 8 and 11 year old, what game should I recommend? We can characterize this group along a number of dimensions, and use those dimensions to say "for groups similar to yours, these games have been successful". For example, we could use group size and mean age as two dimensions that characterize a space, and locate each group at a unique point in the space.

For example, I want a map something like this, which tells me the regions in which each game is optimal:

image

(There may be other dimensions we want to include in this map down the road to increase its fidelity, for example the range of ages within the group)

Given that I have characterized the space in this way, I want to run experiments to identify the optimal game for each point in the space. The brute force method would be to construct a number of groups with every combination of mean age and group size, and then within each block, randomly assign groups to games. For any given point in the space (say the group of four children "Group 1" above), I would then be able to estimate a distribution for the outcomes of each intervention. Given this, I could compare the distributions to assess the likelihood that "Go fish" would be more satisfactory than "Charades".

image

Unfortunately, even if we can run games online, and make each sample cheap, this strategy requires a lot of samples, and gets progressively worse as the number of dimensions we use to characterize a group grows. One strategy that we can use is to learn about one point in the space from nearby points in the space, so a group with a slightly different mean age can inform my estimate, and groups that are increasingly more dissimilar to "Group 1" still contribute some information, but less and less. (For example, monopoly may be inferior for all groups). This is a form of partial pooling that takes the space into account, and we call "Functional" partial pooling. We end up with a distribution predicting outcomes at every point in the space, and these distributions collectively make up an n-dimensional gaussian process (or other functional probabilistic model).

image

This strategy lets us use the samples we collect in a statistically efficient manner, but we can do even better by being very strategic about the samples we collect in the first place. For example, lets say that among older groups, we can tell that chess is preferable to checkers after only 10 samples of each, but among younger groups, there is more variation. We may want to take more samples among younger groups, but stop taking samples among older groups once we are able to confidently make a recommendation. The general process here is to take some data, use it to fit our statistical models, and then make predictions from those models about what we will see in a new (unsampled) group. We then choose the next set of samples to maximize the increase in our ability to make a recommendation for an arbitrary point in the space. We continue this cycle of iterative data collection, modeling, and sample selection until our ability to make a recommendation at the most difficult point in the space exceeds our requirements. (Or we run out of money). We can adapt our sampling by drawing more from points in the space where we have larger uncertainty - this strategy is called "active learning" - and we can also direct samples towards the most promising games for any given location in the space - a strategy known as "bayesian optimization". Together, these strategies can tell us a "utility" for each prospective sample (The "Aquisition Function" in BO terms).

Once we have a "Utility" for each point and game in the space, we know what the most optimal next sample would be. The problem is that in our particular problem, we have other constraints. The first is that it is difficult for us to reliably assemble arbitrary groups. If my utility function tells me that my optimal next sample is to have a group of people ages 45, 55, and 65 sit down to play backgammon, I can't just instantly draw three random people with these qualifications and have them begin playing. At the very least, they would need to consent to participate and then schedule a time to play together. If I had to wait for this process to unfold, my one-at-a-time individually-scheduled experiment campaign would take millennia.

The problem

Instead, the strategy that seems to be most promising is to invite a large number of people for a specific time, only some of whom may show up. When they arrive, assign them to groups and games in such a way that maximizes our learning, given who arrived and the current utility function. The problem is how to construct these groups and assign them the most informative treatments.

The first wrinkle is that when we do batch samples, it isn't optimal to use the unaltered utility function to assign every sample in the batch to a condition. For example, let's say that the utility function says that the next most valuable sample is a 20-year-old playing chess against a 40-year-old. If it happens that we have 100 people show up and half of them are 20 and half are 40, if we used the utility function in a naive way, we'd end up with 50 chess games, each with a 20-year-old and a 40-year-old. We would oversample that point, driving down the variance way below what we need to make a recommendation about this particular group playing chess. We would waste a lot of samples getting more precision than we need. A more useful strategy would be to have at least one game at the peak of the utility function, but then also to have some pairs play checkers, or join groups of 4 and play charades, etc. Essentially, we have to knock down the utility function for the n+1th game assignment based on how much we expect to gain at each location based on our first n game assignments. A simple way to do this would be a distance-weighted kernel that we center on the first chosen point and multiply by the utility function to derive the utility function for the next sample. It won't be perfectly optimal, but it's reasonable.

image

The second wrinkle is more human. If we schedule a 2pm experiment, participants will arrive with some distribution (probably exponential) between 2pm and 2:15. As participants arrive, they will complete screening and eventually be ready to assign to a group. While they are waiting in the lobby, we are paying them for their time, and they are at risk of dropping out, both of which are costly. As a result, we want to balance the goal of forming the optimal set of groups with the cost of paying people for time that we are not getting value (data) out of, and the expected cost of not getting data at all from them if they drop out. So, we have a real-time combinatorial optimization problem in need of some solution.

The last wrinkle is in our causal inference. If we had been able to take the populations of persons eligible to fill each slot in our highest utility game sample, and randomly draw from those populations, we have confidence that this sample contributes to an unbiased estimator of all possible combinations drawn from these populations. However, if we are making combinations of individuals who show up, we need to be able to demonstrate that we aren't introducing biases into the grouping that would make us doubt our conclusions.

Assume we have:

What we want:

Every time a participant arrives, or every n seconds (whichever is more frequent), we want to send the desired algorithm:

Then we want the desired algorithm to reply with:

The algorithm may return zero groups if it thinks the optimal course is to wait for more participants to show up. The algorithm could be stateless (ie, calculate everything based on fresh information) or precompute some likely candidates based on the previously submitted data, and return them when a new request comes in. The algorithm continues until all players have been assigned or have dropped out.

Data is collected (it will take 20 minutes or so) and then the data is used to update the statistical models and the utility functions for sampling at each point. 24 hours later, a different batch of participants is invited, and the process repeats.

markwhiting commented 2 years ago

The motivation here feels really targeted at the Turk recruiting context, but more generally this approach lets us consume a convenience sample without suffering the consequences, or while minimizing the consequences of that sampling approach.

It might be interesting to see where a good dispatcher sits in terms of something like recovered response surface per participant, when compared to a traditional convenience sample, or the best possible sampling of the same population. We might realize that our effort is actually better spent getting closer to more perfect recruiting.