Entity resolution - Githubissues

wwqrd commented 4 years ago

This is a work in progress of entity resolution, to make current progress more visible and to enable collaboration!

Working features:

Can direct output to a script via stdin
Can read script ouput (from stdout)
Can parse matches and show an entity diff
Can save a condensed version of those choices (resolutions)
Can export a file with those choices applied

Missing features:

[ ] ~In-app dialogs for errors (using window. currently)~
[x] Thresholds are not accounted for (all matches are shown) - This is handled by the script
[x] Delete resolutions feature missing
[x] Handle malformed or empty response correctly
[x] Cannot navigate through matches/resolutions
[x] Auto-resolve implicit matches? if A=B and C=D then if A=D, B=D and A=C
[x] session counts are missing (esp new session count)

Known Issues:

[x] does not automatically export after resolution
[x] If sessions get deleted resolutions are invalid, resolutions should probably also be deleted?
[ ] ~promptIds and stageId are discarded from nodes, could/should they be concatenated?~
[ ] ~Concatenated egoIds in export files (they are concatenated under the hood but this info is discarded because the formatter can't interpret that yet)~ Egos need to be cast as a node type which should be selected in the configuration - attributes from the codebook will be matched by name.
[x] Export screen does not update when a new resolution is created (must refresh page/app)
[x] If a second resolution is attempted, the resolver stream exits immediately

wwqrd commented 4 years ago

Script will handle threshold, but app needs to pass the configuration to the script

wwqrd commented 4 years ago

WIP feedback:

I think we should remove the text "variable" from the first cell. I know its supposed to be a heading for the column, but i read it as for the row and was confused. I dont think we need it.

the first row is the part that really confused me. these are node IDs, which the researcher might never have seen before (and certainly won't recognize the nodes by). The intent of that row is to provide a shortcut for "use all in this column". I think it would improve things to (1) show a node 'preview' in the header area, using the same logic as network canvas (so, render a node in the correct color, with the correct label), (2) change the labels for the first row selects to "Use all values from this node".

then, I think "not a match" should be given higher visual priority. right now it is covered by the "skip" button, which is at the bottom in the button group. I think perhaps we could put it on the first row as a third select option that greys out the rest of the table when selected. If you don't like that idea, how about making the button row fixed, so that the button is always visible?. Either way I think we should change the text to "Not a match".

the "resolve" button is a primary action, and so should be in the primary color.

Mockup:

rebeccamadsen commented 4 years ago

What's the best way to test this? What kind of script path and arguments should I be using?

wwqrd commented 4 years ago

@rebeccamadsen This has a couple more outstanding things to do on it now - I'll write up some detailed intructions when it's ready to go.

wwqrd commented 4 years ago

@rebeccamadsen for the script path I think the best one to use would be Simple.py (https://github.com/codaco/entity-resolution-sample), no arguments, full system path. It will assign random probabilities to node pairs.

Hopefully the rest is relatively self explanatory - but if not let's work on improving the UI so that it is!

Some known issues/questions:

csv/json options don't 100% match up, for example entity resolution casts ego as a node type, so "include ego data" is redundant
generally I think the code could do with some refactoring, but maybe that's a second iteration? Anything obvious please let me know

rebeccamadsen commented 4 years ago

The python file is a local file or do I use the url? Should I need to quote the filename if there are spaces in the name? How long should it take to run this? I'm thinking I don't have the filename right or something since I've been waiting for awhile.

jthrilly commented 4 years ago

Some questions:

Why does scrolling feel weird in the comparison table? its super laggy. react-virtualized?
My process exited with no feedback after 2 of 59. 🤷 trying to run the process again gave "something went wrong" in a red box.
When I encountered some of the bugs below, there was no visible feedback that something had gone wrong, so error handling and messaging are important areas to focus on.

Some items:

[x] CHANGE: Update the entity resolution sample to either work with the development protocol, or develop a small example protocol for testing with it.
[x] CHANGE: use the error and info dialogs not the popup style notification or the "something went wrong" red box.
[x] BUG: remove any baked-in assumptions arising from only testing with the development protocol. For example color is not a required property of an entity definition (EntityDiff.js:L31)
[x] CHANGE: after encountering the bug above, the app UI was completely unresponsive. the console showed [ResolverService] Killing process.
[x] BUG: stop ego cast field resetting when process is unsuccessful.
[x] CHANGE: use native select element for ego node type.
[x] CHANGE: add a "browse" button for script path
[x] FEATURE: Richer error messages for all common use-cases. EACCESS, etc. First error I got was because my python script wasn't marked as +x.
[x] BUG: layout variables rendered as [object Object] in comparison table
[x] BUG: after showing matching rows, the next button isnt clickable after clicking "not a match", even after selecting "use all", until "not a match" is selected again.

rebeccamadsen commented 4 years ago

Gave this a go on windows today, and got a couple of errors. Let me know what I should try to help figure out these (so helpful) errors.

This is when I had the script in folder that contains a space: space in filename

So I moved the script to a different folder and got this instead: spawn unknown

rebeccamadsen commented 3 years ago

A minor question: could it remember that i use "python" instead of "python3" when i start a new resolution? And possibly my resolver script path? If I cancel out of the export screen and come back to export later, it doesn't remember my interpreter.

At first I thought it surprising that if I save a resolution, and leave the export screen and come back, I can't change "Node Type", even if I am starting a new resolution. After some time, I realized these were cumulative, not alternate export options. Not really a request for anything, just a comment on my own ignorance in using a new-to-me feature. I'm not sure if there is any way to make that more clear, or if the documentation will convey that for us.

If it doesn't find any matches, can we export anyway? Right now I have to untick the "Enable entity resolution" option to continue with an export that had no resolutions. But maybe the idea is to let them more easily change settings so they can get matches?

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

rebeccamadsen commented 3 years ago

Also, exporting without using entity resolution gives me errors -- graphml or csv.

jthrilly commented 3 years ago

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

This one is pretty simple: ego gets merged with the node list as it is considered to be a possible alter in other networks. I bet this will trip up the new network exporter, but we will have to see.

jthrilly commented 3 years ago

Also, exporting without using entity resolution gives me errors -- graphml or csv.

Probably best to fix this with updating network exporters at this point. Bit confusing that it errors, though. What is the error you get?

Reason this is strange is that although the format for data has changed, it shouldn't be changing within the entity resolution. So data coming out of entity resolution should be just as incompatible with the old network exporters as data straight from the database.

wwqrd commented 3 years ago

Reason this is strange is that although the format for data has changed, it shouldn't be changing within the entity resolution. So data coming out of entity resolution should be just as incompatible with the old network exporters as data straight from the database.

The resolver feature sits between where the export manager fetches the network and where it generates the files.

const exportPromise = makeTempDir()
    // ...
    .then(() => {
        if (!enableEntityResolution) {
            // Old implementaiton
            return this.protocolManager.getProtocolSessions(protocol._id, null, null)
            .then(sessions => sessions.map(formatSessionAsNetwork))
            .then(networks => (useEgoData ? insertEgoInNetworks(networks) : networks))
            .then(networks => (exportNetworkUnion ? [unionOfNetworks(networks)] : networks));
        }

        // resolver implementation
        return this.resolverManager.getNetwork(
            protocol,
            networkOpts,
        );
    })
    .then((networks) => {
        // ... exportFiles()
    });

That means internally it has something like an implementation of:

     return this.protocolManager.getProtocolSessions(protocol._id, null, null)
            .then(sessions => sessions.map(formatSessionAsNetwork))
            .then(networks => (useEgoData ? insertEgoInNetworks(networks) : networks))
            .then(networks => (exportNetworkUnion ? [unionOfNetworks(networks)] : networks));

but simplified. It always merges the networks for instance, and implements ego type casting.

The reason it does that is because this.resolverManager.getNetwork() is also used outside of the export process in the resolve endpoint: because resolutions are iterative, we also resolve the network before passing the nodes to the resolver for further resolutions. Effectively an export without file generation.

Also, exporting without using entity resolution gives me errors -- graphml or csv.

So the reason export doesn't work? The resolution side of things has been updated to support the new sessions format, but the original export code has been untouched.

I'm not sure how much wrangling of the networks the new exporter does internally, but I'm hopeful that it will actually simplify things!

wwqrd commented 3 years ago

A minor question: could it remember that i use "python" instead of "python3" when i start a new resolution? And possibly my resolver script path? If I cancel out of the export screen and come back to export later, it doesn't remember my interpreter.

It defaults to the settings for the last resolution, so without any it'll just reset each time. I think that would be a good improvement though, and it should still maintain the effect of remembering the last resolution. Maybe it should save it in the protocol meta data?

At first I thought it surprising that if I save a resolution, and leave the export screen and come back, I can't change "Node Type", even if I am starting a new resolution. After some time, I realized these were cumulative, not alternate export options. Not really a request for anything, just a comment on my own ignorance in using a new-to-me feature. I'm not sure if there is any way to make that more clear, or if the documentation will convey that for us.

Yes that's correct. It maybe could be changed to apply a different ego cast type per resolution but it could get even more complicated. Not sure where the full messaging should live (docs?), but I will add a prompt to the UI.

If it doesn't find any matches, can we export anyway? Right now I have to untick the "Enable entity resolution" option to continue with an export that had no resolutions. But maybe the idea is to let them more easily change settings so they can get matches?

Thinking on this we will add instructions on what to do for now. but we could potentially add an export button to this panel.

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

@jthrilly is correct on this one.

wwqrd commented 3 years ago

This issue has been superseded by https://github.com/complexdatacollective/Server/issues/292

complexdatacollective / Server

Entity resolution #275

Working features:

Missing features:

Known Issues: