Programmable Dataflows

Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. Yet, the lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation.This project contains the source code for programmable dataflows, a programming model for implementing any data sharing problems with a new contract abstraction, allowing people to move towards a common sharing goal without violating any regulatory, privacy, or preference constraints. The programming model is implemented on top of an intermediary data escrow. (Link to programmable dataflow paper) (Link to data escrow paper)

What are data sharing problems?

We refer to any scenario in which one party wants access to anothers data a data sharing problem. Examples include: advertisers use a data cleanroom to run analysis on joint data; national patient registry shares medical data with researchers for causal discovery; banks pool data to train joint fraud detection model.

What makes data sharing problems challenging?

Consider the following data sharing problem: a few banks are interested in pooling credit card transaction data to train more accurate fraud detection models, subject to the following constraints. 1) It is not commercially interesting to pool their data, unless every bank has the guarantee that the joint model benefits themselves, instead of only helping others. 2) The shared data should only be used for model training, and nothing else.

A main challenge of data sharing is that people lack information to assess whether a dataflow is desirable before it takes place. For example, the banks only want to release a joint model if it meets some accuracy threshold, but they have no way of ensuring that without sharing the data to train the model. Additionally, banks have no guarantee that the other banks will only use the data for model training, once they share the raw data. To preclude adverse consequences, many people default to not sharing.

Solving the challenges with programmable dataflows

We introduce a new contract abstraction that bounds the consequences of each dataflow by making it explicit who contributes data, what computation takes place on that data, who receives the result, and under what conditions. Importantly, it provides this information before an intended dataflow takes place, thus addressing the challenge by helping agents make an informed decision on whether to allow the dataflow. The programming model implements the contract abstraction, enabling people to solve any data sharing problem through a sequence of contract propositions, approvals, and executions.

Set up the repo

Clone the repo.

git clone https://github.com/TheDataStation/DataStation.git

Run the following command from the root directory to install the necessary packages.

pip install -r requirements.txt

Create the needed directories

mkdir SM_storage SM_storage_mount

Run a simple application

Here is the code to run a simple data sharing application: share the schema of a csv file with others.

Use the following configs in data_station_config.yaml

cpm_path: "example_cpm/share_schema_app.py"
trust_mode: "full_trust"
in_development: True

Execute the script that contains the example application.

python3 -m integration_new.general_full_trust

Alternatively, to access the application through a web UI (FastAPI):

python3 -m server.fastapi_server

Enabling Docker to run functions

Ensure that you have Docker enabled on your machine.

Start Docker on macOS:

open -a Docker

Start Docker on linux:

sudo systemctl start docker
sudo chmod 666 /var/run/docker.sock

Use the following configs in data_station_config.yaml

in_development: False

Acknowledgments

This work was supported by the National Science Foundation (NSF) under No. 2040718.

//: # ()

TheDataStation / datastation-escrow

readme