Architect design for 'shared fingerprint database' connections

johnhbenetech commented 3 years ago

This task is to begin tracking design decisions for implementation of the 'shared fingerprint' database and related interactions.

User stories for JusticeAI user:

I want the ability to submit my processed videos' fingerprints into a shared database for cross organizational comparison
I want to be able to start this process from the front end interface (and receive progress updates)
I want to be able to start this process via a script in a terminal

Benetech requirements:

We need a centralized way to receive and store these fingerprints
We need to control who can read/write to this database (possibly by manually assigning keys/credentials)
We need to track which fingerprints came from which sender
We need to run match comparisons across all videos contributed from all partners and store the resulting match entities/scores
Note: this process can look different than the current generate_matches script as it will only be run by benetech on a separate system

User stories for JusticeAI user receiving results:

When exploring my local collection, I would like to also be alerted to the presence of matches in other partner databases. These won't be playable files or have metadata, but should be counted and represented in the UI with the corresponding match score
the representation of the partner file in the UI should have some element that is meaningful to both sides. For example, if Partner A sees a highly matched video from Partner B - they should be able to correspond with Partner B (out of platform) to say 'can you share with me the file XYZ' - where XYZ allows them to identify the file from their collection.

Considerations:

User will need place to set authentication credential/keys
We should think through the possible benefits of 'synching' approach which keeps a local mirror of the cloud matches. In this way, the user can choose when to go online - and we can do some bulk hash/version check to see if any new matches are present before actually performing the sync.
I think this local sync approach can be beneficial as well since we wouldn't have to maintain cloud API infrastructure to return matches programmatically from the cloud

stepan-anokhin commented 3 years ago

Contents:

Common Considerations
Organizing Fingerprint Repository
Organizing Data Processing
Collaboration Support

Common Considerations

Decentralized Infrastructure

Whenever possible we shouldn't enforce centralized architecture:

This may help to distribute maintenance costs.
In theory any organization maintaining a single central repository may face a pressure of some type (taking into account our problem domain) from governments or other party. Thus inherently centralized architecture may introduce a single point of failure.

What does it imply in our case?

The implementation shouldn't assume there is a single repository.
We should make it as easy as possible for third parties to deploy and maintain their own fingerprint repositories (make it friendly for federalization).
We should avoid centralized data processing whenever possible.

We Should Split Data Processing and External Repository Access

In this context data processing means match detection between local and external fingerprints and repository access means pushing and pulling fingerprints to/from external repositories. Assuming data-processing is offloaded to the clients we should support a workflow that will allow to split data processing and repository access:

The data processing algorithm should rely only on local data (e.g. living in a local database).
There should be a tool to push/pull fingerprints to/from external repository that doesn't involve processing.
Data processing algorithms shouldn't know how the data was obtained. Only from which source it was obtained.

All Data Should Be Available Locally

Our customers use different workflows (e.g. with and without webui+database). In case of webui+database workflow we should make sure that all the source data and all processing results are available in local database if we want to display them via frontend.

We Should NOT Require Internet Connection of Highly Secured Nodes

Because of security considerations we should not require from highly secured nodes (with access to the sensitive data) to have internet access in order to push/pull fingerprints to/from external repositories:

This is not a good idea from the security stand-point.
We should not ask our customers to just trust us if we can simply prove that leaks are impossible. If we require that the application with access to the sensitive information must have internet access, then how do we prove that it will not simply upload this information somewhere without user's consent?

We should support an offline-only workflow in which the application is not required to have internet access in order to push/pull fingerprints.

This could be done by packaging fingerprints:

There should be a tool to collect local fingerprints and create a package file containing only fingerprints and hashes.
There should be a tool to push fingerprints from that package.
There should be a tool to pull fingerprints from remote repository and store them in a package file.
There should be a tool to populate local data storage using this pulled package-file.
The package file format should be human-readable to facilitate manual review.

This may work as follows:

User creates a fingerprint package on secured node, review it, then transfers it (e.g. using thumb-drive) to the node with internet access and pushes it to the remote repository.
User downloads fingerprint package from a remote repository and then transfers it to the secured offline node (e.g. using thumb drive) to populate local storage and then runs match detection algorithm.

We Should use Hashes as Remote File IDs

We cannot use any other information to reliably link local files to remote repository entries:

Any local "synthetic" ids may be lost if the local database is purged.
Any external id (generated by the repository) may be lost if client fails in the middle of data push.
We cannot disclose real file names.
We don't care if duplicate files will result in a single entry in remote repository.

Fake Fingerprints Problem

We should keep in mind that if clients may read all the fingerprints from remote repositories they may abuse this information:

Customers may push fake entries to pretend they have the same videos as other customers.
Customers may push fake entries to pretend they have similar videos to other customers' videos. In theory this may be abused to put an ungrounded pressure in negotiations.

As we have just a handful of trusted customers (interested in long-term collaboration) we can simply accept this risk.

Organizing Fingerprint Repository

Synchronization Strategy

Synchronization between local data and remote repository could be achieved as follows:

Each repository entry has a synthetic integer id.
The id is always incremented when a new entry is inserted.
Client keeps track of the last pulled id.
To synchronize with remote repo client pulls all ids > than the last pulled id.
We don't care if some records deleted from remote storage remain in local storage.
Remote records cannot be updated.

A remote repository entry may look like this:

Contributor ID	Fingerprint	Hash	Serial ID (int)

Alternative Repository Implementations

There are different ways in which the remote repository could be implemented.

Smart Repository as Trusted Third-Party

This is a "pessimistic" approach to organize a repository:

Fingerprint repository is a REST service that restrict access whenever possible.
Bulk API to push/pull data.
Users may push any fingerprints.
Users cannot pull arbitrary fingerprints.
The repository searches fingerprint matches in a background (we can reuse existing Celery setup).
The repository allows to pull fingerprints only if they are matched with some client's fingerprint.
Minimalist Web UI to see basic statistics and perform CRUD operations on user accounts (for admins).
Benetech deploys and maintains a single instance of the repository.

Pros:

Solves the fake fingerprints problem.
Enforces security and data consistency: repository as a trusted third-party to perform secured multi-party computation (cross-organization match detection).

Cons:

Centralized architecture
Highest maintenance cost
Highest implementation cost (~ 1.5 month)

Simple Repository

This is a "realistic" approach:

Repository is a REST service.
Bulk API to push/pull data.
Users may push any fingerprints.
Users may pull any fingerprints at any time.
Minimalist Web UI to see basic stats and perform CRUD operation on users (for admins).
Simple deployment. There are server and client packages published https://pypi.org/.
Benetech deploys and maintains an instance. Other organization may deploy their own repos.

Pros:

Moderate implementation cost (~ 2 weeks)
Moderate maintenance cost (e.g. 1 EC2 instance + RDS service)
Federated infrastructure: anyone can easily deploy an independent repo.

Bare Database

Here is an "optimistic" approach:

Clients push/pull fingerprints to shared database.
Clients can insert and select but not update or delete records.
We provide scripts to push/pull fingerprints.
Benetech deploys RDS service to keep shared fingerprints.

Pros:

Lowest implementation cost (just push and pull scripts)
Lowest maintenance cost? (e.g. just deploy an RDS)
Federated infrastructure: anyone can deploy a new shared DB

Cons:

This approach will not scale if number of clients will grow
Data consistency is not guaranteed (e.g. clients my write wrong/fake contributor ids)
Manual configuration of access rights for each new customer.
Common maintenance operations are performed manually (adding new users, configuring access privileges, etc.)
No standard interface
No visualization, no UI.

Organizing Data Processing

We need to update local database schema to keep track of remote data source.

There are at least two approaches:

Re-use existing File entity type for external files
Add new entity types for external files and matches.

Re-using existing types: reuse

Pros:

This will reduce implementation cost (e.g. cluster visualisation will work out-of-the box with external files)

Cons:

Data consistency will not be guaranteed:
- There will be a lot of attributes that doesn't make sense for remote files (e.g. exif data, file path) but some of them are necessary for local files (e.g. file path)
- There will be some attributes that are required for remote files (e.g. contributor id, external serial id) but doesn't make sense for local files.
- To handle this we will either give up some constraints or provide some ad-hoc rules programmatically (e.g. use "contributor-id:hash" as file path for remote files).

Explicit distinction between local and remote files:

Pros:

Data consistency is guaranteed

Cons:

Special treatment in UI and backend will be required

Collaboration Support

We will need to create a command line tool(s) and UI elements for the following operations:

Push/Pull fingerprints to/from remote repo
Pack/Unpack fingerprints (for offline-only mode)
Find local files by hash
Manage Remote Repos and their Credentials
Browse external matches.

stepan-anokhin commented 3 years ago

Notes on Possible Implementation.

There are two big parts of required implementation efforts:

A. Implement a remote repository
B. Organize remote fingerprints processing and repository access from client code to pull/push fingerprints

A. Implementing Remote Repository.

As we discussed on the last meeting we will initially go for the Bare Database option, and then probably (under favorable circumstances) implement the Simple Repository variant (see the previous comment).

Implementation effort required for the Bare Database variant:

Develop database schema and contributor access privileges (task #256):
1. For each contributed fingerprint the schema must include the following information:
  1. Serial id (globally unique or unique per contributor to facilitate efficient pulling).
  2. Original video file hash (ideally the same as in local storage - sha256)
  3. Contributor unique identifier (if there are multiple contributors per database)
2. Ideally contributors must have privileges to perform CRUD operations on their own records and just select-privileges over other contributors' records.
  1. In case of multiple contributors per single database and single table approach we can provide just insert and select but not update privileges for all contributors.
  2. In case of single contributor per database approach we will need to create multiple DBs with CRUD permissions for one user and select permission for others.
Once we decided on how to organize bare database storage we need to implement minimal tool-set (a bunch of scripts) to facilitate bare database approach (task #257):
1. At the very minimum the CLI tools must support the following operations:
  1. Setup a fresh database, apply required schema
  2. Add a new user and grant all required permissions
2. This can be done as python script using the boto3 behind the scenes (if we are going to deploy RDS).

@johnhbenetech My strong concern here is that this approach may easily turn into GIGO (if we won't be very cautious).

As it is not clear if we'll go for Simple Repository option, we can just skip its details for now.

B. Organizing Remote Fingerprint Processing and Repository Access from Client Code.

To fully support remote fingerprint matching we need at least the following features:

Support basic remote repository management (tasks #251, #258)
1. Add/delete remote repository
2. List remote repositories
3. Show repository details.
Support online-workflow (assuming remote repository is network accessible) (tasks #252, #259):
1. Pull list of remote repository contributors.
2. Pull fingerprints from remote repository
3. Pull fingerprints of the particular contributor
Support remote fingerprint processing (tasks #254, #261):
1. Find matches for all remote fingerprints
2. Find matches for particular repository
3. Find matches for particular repository contributor.
Support collaboration actions (tasks #253, #260):
1. List remote fingerprints (all/by repository/by contributor)
2. List remote fingerprint matches (all/by repository/by contributors)
3. Query local storage and display contents:
  1. Find file by hash
Support offline-workflow (tasks #255, #262):
1. Package fingerprints from local database
2. Package fingerprints from CSV
3. Push fingerprints from package to remote repository
4. Pull fingerprints from repository and create fingerprint package
5. Populate local database from fingerprint package
6. Populate local CSV files from fingerprint package

These features must be available both as a command-line tool(s) and as part of Web Fronted. Note that only items (2) and partially (5) depend on the remote repository implementation.

Also the above feature list provides a rough idea of how the required effort could be split into separate tasks.

The command line tool could be organized a hierarchical script with subcommands.

@johnhbenetech please review the above list. For UI part I can draw some inspiration from existing mockups (like this one). But any detailed mockups/suggestions from you will be extremely helpful. For now I'll just outline required tasks.

benetech / VideoDeduplication