benetech / VideoDeduplication

GNU General Public License v3.0
34 stars 12 forks source link

Architect design for 'shared fingerprint database' connections #239

Closed johnhbenetech closed 3 years ago

johnhbenetech commented 3 years ago

This task is to begin tracking design decisions for implementation of the 'shared fingerprint' database and related interactions.

User stories for JusticeAI user:

Benetech requirements:

User stories for JusticeAI user receiving results:

Considerations:

stepan-anokhin commented 3 years ago

Contents:

Common Considerations

Decentralized Infrastructure

Whenever possible we shouldn't enforce centralized architecture:

What does it imply in our case?

We Should Split Data Processing and External Repository Access

In this context data processing means match detection between local and external fingerprints and repository access means pushing and pulling fingerprints to/from external repositories. Assuming data-processing is offloaded to the clients we should support a workflow that will allow to split data processing and repository access:

All Data Should Be Available Locally

Our customers use different workflows (e.g. with and without webui+database). In case of webui+database workflow we should make sure that all the source data and all processing results are available in local database if we want to display them via frontend.

We Should NOT Require Internet Connection of Highly Secured Nodes

Because of security considerations we should not require from highly secured nodes (with access to the sensitive data) to have internet access in order to push/pull fingerprints to/from external repositories:

We should support an offline-only workflow in which the application is not required to have internet access in order to push/pull fingerprints.

This could be done by packaging fingerprints:

This may work as follows:

We Should use Hashes as Remote File IDs

We cannot use any other information to reliably link local files to remote repository entries:

Fake Fingerprints Problem

We should keep in mind that if clients may read all the fingerprints from remote repositories they may abuse this information:

As we have just a handful of trusted customers (interested in long-term collaboration) we can simply accept this risk.

Organizing Fingerprint Repository

Synchronization Strategy

Synchronization between local data and remote repository could be achieved as follows:

A remote repository entry may look like this:

Contributor ID Fingerprint Hash Serial ID (int)

Alternative Repository Implementations

There are different ways in which the remote repository could be implemented.

Smart Repository as Trusted Third-Party

This is a "pessimistic" approach to organize a repository:

Pros:

Cons:

Simple Repository

This is a "realistic" approach:

Pros:

Bare Database

Here is an "optimistic" approach:

Pros:

Cons:

Organizing Data Processing

We need to update local database schema to keep track of remote data source.

There are at least two approaches:

Re-using existing types: reuse

Pros:

Cons:

Explicit distinction between local and remote files: explicit

Pros:

Cons:

Collaboration Support

We will need to create a command line tool(s) and UI elements for the following operations:

stepan-anokhin commented 3 years ago

Notes on Possible Implementation.

There are two big parts of required implementation efforts:

A. Implementing Remote Repository.

As we discussed on the last meeting we will initially go for the Bare Database option, and then probably (under favorable circumstances) implement the Simple Repository variant (see the previous comment).

Implementation effort required for the Bare Database variant:

  1. Develop database schema and contributor access privileges (task #256):
    1. For each contributed fingerprint the schema must include the following information:
      1. Serial id (globally unique or unique per contributor to facilitate efficient pulling).
      2. Original video file hash (ideally the same as in local storage - sha256)
      3. Contributor unique identifier (if there are multiple contributors per database)
    2. Ideally contributors must have privileges to perform CRUD operations on their own records and just select-privileges over other contributors' records.
      1. In case of multiple contributors per single database and single table approach we can provide just insert and select but not update privileges for all contributors.
      2. In case of single contributor per database approach we will need to create multiple DBs with CRUD permissions for one user and select permission for others.
  2. Once we decided on how to organize bare database storage we need to implement minimal tool-set (a bunch of scripts) to facilitate bare database approach (task #257):
    1. At the very minimum the CLI tools must support the following operations:
      1. Setup a fresh database, apply required schema
      2. Add a new user and grant all required permissions
    2. This can be done as python script using the boto3 behind the scenes (if we are going to deploy RDS).

@johnhbenetech My strong concern here is that this approach may easily turn into GIGO (if we won't be very cautious).

As it is not clear if we'll go for Simple Repository option, we can just skip its details for now.

B. Organizing Remote Fingerprint Processing and Repository Access from Client Code.

To fully support remote fingerprint matching we need at least the following features:

  1. Support basic remote repository management (tasks #251, #258)
    1. Add/delete remote repository
    2. List remote repositories
    3. Show repository details.
  2. Support online-workflow (assuming remote repository is network accessible) (tasks #252, #259):
    1. Pull list of remote repository contributors.
    2. Pull fingerprints from remote repository
    3. Pull fingerprints of the particular contributor
  3. Support remote fingerprint processing (tasks #254, #261):
    1. Find matches for all remote fingerprints
    2. Find matches for particular repository
    3. Find matches for particular repository contributor.
  4. Support collaboration actions (tasks #253, #260):
    1. List remote fingerprints (all/by repository/by contributor)
    2. List remote fingerprint matches (all/by repository/by contributors)
    3. Query local storage and display contents:
      1. Find file by hash
  5. Support offline-workflow (tasks #255, #262):
    1. Package fingerprints from local database
    2. Package fingerprints from CSV
    3. Push fingerprints from package to remote repository
    4. Pull fingerprints from repository and create fingerprint package
    5. Populate local database from fingerprint package
    6. Populate local CSV files from fingerprint package

These features must be available both as a command-line tool(s) and as part of Web Fronted. Note that only items (2) and partially (5) depend on the remote repository implementation.

Also the above feature list provides a rough idea of how the required effort could be split into separate tasks.

The command line tool could be organized a hierarchical script with subcommands.

@johnhbenetech please review the above list. For UI part I can draw some inspiration from existing mockups (like this one). But any detailed mockups/suggestions from you will be extremely helpful. For now I'll just outline required tasks.