complexdatacollective / Server

A tool for storing, analyzing, and exporting Network Canvas interview data.
http://networkcanvas.com/
GNU General Public License v3.0
2 stars 2 forks source link

Entity Resolution v2. #292

Closed wwqrd closed 3 years ago

wwqrd commented 3 years ago

Entity resolution allows for unified sessions to combine nodes across different networks where they are considered to be the same node. Egos are also incorporated into the node list, as an ego from one network may match a node in the network of a different participant.

Aims

Outline of the proposed system and protocol

System diagram

Resolver v2

Communication example

  1. Resolver() sends pair to User Script (RESOLVE)
  2. User Script sends result to Resolver() (MATCH/REJECT/MAYBE) a. If MATCH, Resolver() will update result, and send the next match to the script (Return to 1.) b. If REJECT, Resolver() will update result, and send the next match to the script (Return to 1.) c. If MAYBE, Resolver() will forward request to UI
  3. User can manually resolve the two nodes and sends result to Resolver() (MATCH/REJECT) a. If Match, see 3a. b. If REJECT, see 3b.

Worked resolution example

A, B, C, D, E, F - nodes
1:  A -> B = MATCH // A|B similarity greater than matchThreshold, so matched
2:  (A|B) -> C = MAYBE(C) // A|B|C similarity between thresholds so human resolution required. human resolves with C attributes.
3:  (A|B|C) -> D = REJECT // A|B|C|D similarity below threshold, so rejected
4:  (A|B|C) -> E = REJECT // A|B|C|E similarity below threshold, so rejected
5:  (A|B|C) -> F = REJECT // A|B|C|F similarity below threshold, so rejected
6:  D -> E = MATCH // D|E similarity greater than matchThreshold, so matched
7:  (D|E) -> F = REJECT // D|E|F similarity below threshold, so rejected
- STARTS OVER -
8:  (A|B|C) -> (D|E) = MAYBE(A|B|C|D|E) // A|B|C|D|E similarity between thresholds so human resolution required. human resolves with combination of all attributes.
9:  (A|B|C|D|E) -> F = REJECT // A|B|C|D|E|F similarity below threshold, so rejected
- STARTS OVER -
10: (A|B|C|D|E) -> F = REJECT // A|B|C|D|E|F similarity below threshold, so rejected
- FINISH -
wwqrd commented 3 years ago

There is a work in progress branch at feature/entity-resolution-2.

Current state is:

jthrilly commented 3 years ago

Further discussion on this:

The above approach has one significant limitation, which is that it limits user scripts to dyad-level comparisons. This precludes approaches that would use network structure, or take into account the whole network distribution of attributes.

To address this, I came up with the following tweaked version of the above:

1. Server sends entire dataframe to the python script
2. Script returns a stream of results with the following:
  - 2.1 Send A and B for user intervention. Applies to any nodes falling between the scripts low and high threshold.
  - 2.2 Merge A and B, to create AB. Applies to any nodes above the scripts high threshold.
  - 2.3 Finished. At the end, script sends a 'finished' result. All nodes not subject of a message are presumed unmerged.
3. User resolves any 2.1 result visually, producing a further stream of results:
  - 3.1 Not a match. Nothing happens in this case, but we store this meta data so that this match is not presented again
  - 3.2 Merge A and B, to create AB.
4. Return to step 1 with the latest dataframe 

We decided upon reflection to stick with the v1 implementation of this feature for now, but the above approach would (theoretically) address all outstanding issues we have identified.