NCI-Agency / anet

Advisor Network
MIT License
24 stars 11 forks source link

Epic 2958: Data replication #2958

Open VassilIordanov opened 4 years ago

VassilIordanov commented 4 years ago

Currently ANET is a closed system. 2 instances of ANET can't synchronize portions of their data stores.

The following behaviours are desired:

2 instances of ANET with unidirectional synchornization As a user, I want to be able to use one instance of ANET to create or modify Reports/People/Organziations/etc and after a certain delay I want to be able to see the same entities, with matching UUIDs on another instance.

Example: ANET instance 1 is on a public network, open to mobile devices while ANET instance 2 is on a private network, behind a corporate firewall. Users can use ANET instance 1 to draft new reports. After a certain period of time these reports should appear on ANET instance 2. ANET instance 2 can contain a complete dataset - people/positions/organizations/etc. ANET instance 1 should contain a minimal subset of that content, sufficient to draft reports or perform other needed tasks. ANET instance 1 should not contain information that is of interest to data breaches. As an example, if ANET instance 2 has a directory of employees, their personal information such as names, or email addresses should not be stored in ANET instance 1, but instead their job position should be used to reference them in engagements.

Example: ANET instance 1 is a prodution database, ANET instance 2 is a training database. A training event wants to use the ANET instance 2 in a manner that they can create data that will not pollute the production data while being able to receive and observe live updates of the production data.

A single instance of ANET with different versions of the same data entity Currently data entities in ANET are identified by a UUID. For example - a Person entry has a UUID. In the current design, an entry with an UUID is unique - there could be only one entry for a specific UUID. This creates a "single truth" constraint that can be rather limiting. We want to be able to have 2 versions of an entity refering to the same UUID, and a management mechanisms for it an enabler to enhance ANET functionality. This request is rather similar to the first request - both deal with potentially different version of an entity with the same UUID - in the first case in 2 distributed instances, while in the second case in the same instance.

Example: Currently, we can have a singular representation for a Location for a given UUID. This results in the following issue regarding editing rights of that Location: if any user of ANET can modify a Location, we end up with a situation where changes are made in an unmanaged or unwanted manner (i.e. users rename the wrong location). If on the other hand we restrict who can modify or add a Location to a super user (the current behaviour), we end up in a situation where users who have useful information about an inaccuracy in ANET - but have no permissions - are not encouraged to correct it / enter it, or worse pick an inaccurate location as the intended one does not exist yet. The envisioned solution is that such users could propose a new version of a Location, which can subsequently go to a super user for approval. This could apply obviously to all ANET entity objects.

Example: A similar concept could be used to recreate historical snapshots of ANET.

Proposed Approach

Additional context See related prior work here: https://github.com/NCI-Agency/anet/issues/552 There are also many academic and other references, let's discuss on slack

oayvazoglusim commented 4 years ago

For the different versions of the same data entity : Another approach may be using draft tables Every required entity may have a clone table that named as "entityname"_draft . Superusers/Admins may decide if an entity they are going to define is draft or not through admin page UI. There may be additional columns on draft tables like approved,active,insert_date,last_update_date. Every change on a draft entity results a new copy/insert on draft table. After an update occured on a draft entity , old entry of draft's "active" column set as false and "last_update_date" set with the time that update is occured. After a new insertion occured on a draft table , "active" column is set to true and "insert_date" set with the time the insertion is occured. Related entries of an entity should be hold by the same uuid's .

With this approach ANET backend may also needs minimal modification. The only difference will be if "draft" flag is selected through UI , then the query will be modified just only by adding a prefix to the end of the "entity" name like "entity_draft". The only needed complex modification will be while updating a draft entity.But this way still seems more acceptable .

When Superuser/Admin decides a draft to be approved then the record on draft entity table which is "active=true" will be copied to "entity" table . For example when an entry is approved on report : It will be copied from report_draft table to report table. After an entry is approved , it's last entry on draft table will be modified as "approved=true,active=false,last_update_date=copy time of transaction"

This approach is just yet an assumption and can be modified through depends on dependencies that we missed out or not think about yet

Advantages No need to modify uuid mechanism Needs minimal backend modifications Easy to track all updates on entities through draft tables Since the versions will be in draft tables instead of the main table, the size of the main tables does not increase unnecessarily.

Disadvantages It can be hard to manage foreign key relations of entities (Still have to think in detail)

oayvazoglusim commented 4 years ago

epic_2958_ADD_v0.2.docx scenarios_people_epic2958.pdf

ubeyde-rizaoglu commented 3 years ago

ANET - REPLICATION PROCESS.docx Document showing the main plan for the replication process is attached.

ATILLAM67 commented 3 years ago

my question is , how does this work in a diode environment when there is high side and low side. revisit of this use case for such architecture might ne need.

ubeyde-rizaoglu commented 3 years ago

my question is , how does this work in a diode environment when there is high side and low side. revisit of this use case for such architecture might ne need.

Solution is already designed considering the diode environment. Transferred data packets will be sent over UDP or SMTP protocol, and there is not any dependency on knowing the packet arrivals. @ATILLAM67 could you please specify the unclear point about the design related with diode environment?