How can we balance privacy with multiple queries?

benjaminsavage commented 3 years ago

Based on my understanding of the proposal, it is possible to execute multiple queries against the trail store. Executing a custom function does not delete/clear the trail store. As such, a single piece of data in the trail store may participate in multiple reports.

My understanding of differential privacy is that the number of distinct queries you can execute against a set of raw data is an important parameters. To achieve an epsilon of X, if you only allow a single query you need to add M noise. If you want to allow two queries against the same underlying data, you'll need to add 2*M noise to achieve the same value X for epsilon.

So the more queries we allow each trail-store event to participate in the more noise we will have to add. This seems like it could seriously undermine the utility of the API.

Noeda commented 3 years ago

I think my initial assessment after working through this is that SPURFOWL makes it a lot easier to defeat differential privacy. Attempts to change details about SPURFOWL also make SPURFOWL much less useful, but I could be wrong. Let me write out my thoughts.

Let's say you are computing last-click attribution with SPURFOWL and trail store at time T1 has this:

+------------+-------------+------------+-------------+
|   type     |  site       |  adv       |  timestamp  |
+------------+-------------+------------+-------------+
| impression |  news.com   |  shoes.com | 2020-12-03  |
| impression |  news.com   |  shoes.com | 2020-12-04  |
| impression |  news.com   |  shoes.com | 2020-12-05  |
|  click     |  news.com   |  shoes.com | 2020-12-05  |
| site visit |  shoes.com  |  shoes.com | 2020-12-05  |
+------------+-------------+------------+-------------+

We run last-click attribution model, reporting function runs at 2020-12-06 and reports no conversions because the user only visited the site, did not convert.

Now at time T2 we have:

+------------+-------------+------------+-------------+
|   type     |  site       |  adv       |  timestamp  |
+------------+-------------+------------+-------------+
| impression |  news.com   |  shoes.com | 2020-12-03  |
| impression |  news.com   |  shoes.com | 2020-12-04  |
| impression |  news.com   |  shoes.com | 2020-12-05  |
|  click     |  news.com   |  shoes.com | 2020-12-05  |
| site visit |  shoes.com  |  shoes.com | 2020-12-05  |
| site visit |  shoes.com  |  shoes.com | 2020-12-07  |
| conversion |  shoes.com  |  shoes.com | 2020-12-07  |
+------------+-------------+------------+-------------+

If reporting function now runs again, it'll report one conversion.

HOWEVER, this only works if we don't throw away the trail store data. If, for privacy reasons, we deleted trail store after each report, we'd have this trail store at time T2:

+------------+-------------+------------+-------------+
|   type     |  site       |  adv       |  timestamp  |
+------------+-------------+------------+-------------+
| site visit |  shoes.com  |  shoes.com | 2020-12-07  |
| conversion |  shoes.com  |  shoes.com | 2020-12-07  |
+------------+-------------+------------+-------------+

No conversion is recorded, because the information about impressions and clicks were lost when we sent the report at time T1.

The latter approach avoids the issue with losing differential privacy, but it is less useful for building interesting reports.

I'm trying to brainstorm some ideas around this. Maybe:

1) Send reports rarely. Once a week. Have a time limit of something like (7*5 = 35) days of data the trail store. You could still query the same data multiple times but it'll be capped at a low number. It would be a serious tradeoff to only see full report data at one week delay.

2) The report function could abstain from sending a report. The output of the sandboxed JavaScript trailMap() in SPURFOWL could also return a boolean true/false to stop sending the report, for an opportunity to try send it again later. One possible trigger could be that you send all the reports after the first conversion (or attributed conversion). This will still make it harder to make interesting reports but not as bad as dropping all the data randomly.

function trailMap(trail_store) {
    var is_conversion_in_trail = false;
    var number_of_conversions = 0;

    for (var i = 0; i < trail_store.length; ++i) {
        ...
        if (trail_store[i]['type'] === 'conversion') {
            is_conversion_in_trail = true;
        }
    }

    // Send the report if a conversion was somewhere in the report.
    if (is_conversion_in_trail) {
        // Ask trail store to clear the data
        return {'clear_data': true,
                    {'conversions': {'value': number_of_conversions}}
    }

    // Abstain from sending a report otherwise
    return false;
}

3) Combine 1) and 2) somehow, maybe let the trailMap() know the "budget" of using the data in trail store. Let the reporting agency adjust how many days to keep data, in exchange for how often a report can be sent. Let the trailMap() decide if to clear the data or not (in exchange for being able to run more queries). At this point things would start to get quite complicated for users of SPURFOWL. But this kind of approach would give the control over all the trade-offs to the reporting agency, even if it is complicated.

As of writing of this I see this as a bit of an unsolved issue.

benjaminsavage commented 3 years ago

I realize that this dramatically reduces the set of use-cases which can be supported, but if we wanted to specialize this to just focus on attribution...

One option is to only evaluate the arbitrary function once a conversion event is received, and clear the trail store at that time.

AdRoll / privacy

How can we balance privacy with multiple queries? #5