complexdatacollective / Server

A tool for storing, analyzing, and exporting Network Canvas interview data.
http://networkcanvas.com/
GNU General Public License v3.0
2 stars 2 forks source link

Entity resolution #275

Closed wwqrd closed 3 years ago

wwqrd commented 4 years ago

This is a work in progress of entity resolution, to make current progress more visible and to enable collaboration!

Working features:

Missing features:

Known Issues:

wwqrd commented 4 years ago

Script will handle threshold, but app needs to pass the configuration to the script

wwqrd commented 4 years ago

WIP feedback:

  • I think we should remove the text "variable" from the first cell. I know its supposed to be a heading for the column, but i read it as for the row and was confused. I dont think we need it.
  • the first row is the part that really confused me. these are node IDs, which the researcher might never have seen before (and certainly won't recognize the nodes by). The intent of that row is to provide a shortcut for "use all in this column". I think it would improve things to (1) show a node 'preview' in the header area, using the same logic as network canvas (so, render a node in the correct color, with the correct label), (2) change the labels for the first row selects to "Use all values from this node".
  • then, I think "not a match" should be given higher visual priority. right now it is covered by the "skip" button, which is at the bottom in the button group. I think perhaps we could put it on the first row as a third select option that greys out the rest of the table when selected. If you don't like that idea, how about making the button row fixed, so that the button is always visible?. Either way I think we should change the text to "Not a match".
  • the "resolve" button is a primary action, and so should be in the primary color.

Mockup:

Screenshot 2020-06-02 at 12 41 54
rebeccamadsen commented 4 years ago

What's the best way to test this? What kind of script path and arguments should I be using?

wwqrd commented 4 years ago

@rebeccamadsen This has a couple more outstanding things to do on it now - I'll write up some detailed intructions when it's ready to go.

wwqrd commented 4 years ago

@rebeccamadsen for the script path I think the best one to use would be Simple.py (https://github.com/codaco/entity-resolution-sample), no arguments, full system path. It will assign random probabilities to node pairs.

Hopefully the rest is relatively self explanatory - but if not let's work on improving the UI so that it is!

Some known issues/questions:

rebeccamadsen commented 4 years ago

The python file is a local file or do I use the url? Should I need to quote the filename if there are spaces in the name? How long should it take to run this? I'm thinking I don't have the filename right or something since I've been waiting for awhile.

jthrilly commented 4 years ago

Some questions:

Some items:

rebeccamadsen commented 4 years ago

Gave this a go on windows today, and got a couple of errors. Let me know what I should try to help figure out these (so helpful) errors.

This is when I had the script in folder that contains a space: space in filename

So I moved the script to a different folder and got this instead: spawn unknown

rebeccamadsen commented 3 years ago

A minor question: could it remember that i use "python" instead of "python3" when i start a new resolution? And possibly my resolver script path? If I cancel out of the export screen and come back to export later, it doesn't remember my interpreter.

At first I thought it surprising that if I save a resolution, and leave the export screen and come back, I can't change "Node Type", even if I am starting a new resolution. After some time, I realized these were cumulative, not alternate export options. Not really a request for anything, just a comment on my own ignorance in using a new-to-me feature. I'm not sure if there is any way to make that more clear, or if the documentation will convey that for us.

If it doesn't find any matches, can we export anyway? Right now I have to untick the "Enable entity resolution" option to continue with an export that had no resolutions. But maybe the idea is to let them more easily change settings so they can get matches?

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

rebeccamadsen commented 3 years ago

Also, exporting without using entity resolution gives me errors -- graphml or csv.

jthrilly commented 3 years ago

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

This one is pretty simple: ego gets merged with the node list as it is considered to be a possible alter in other networks. I bet this will trip up the new network exporter, but we will have to see.

jthrilly commented 3 years ago

Also, exporting without using entity resolution gives me errors -- graphml or csv.

Probably best to fix this with updating network exporters at this point. Bit confusing that it errors, though. What is the error you get?

Reason this is strange is that although the format for data has changed, it shouldn't be changing within the entity resolution. So data coming out of entity resolution should be just as incompatible with the old network exporters as data straight from the database.

wwqrd commented 3 years ago

Reason this is strange is that although the format for data has changed, it shouldn't be changing within the entity resolution. So data coming out of entity resolution should be just as incompatible with the old network exporters as data straight from the database.

The resolver feature sits between where the export manager fetches the network and where it generates the files.

const exportPromise = makeTempDir()
    // ...
    .then(() => {
        if (!enableEntityResolution) {
            // Old implementaiton
            return this.protocolManager.getProtocolSessions(protocol._id, null, null)
            .then(sessions => sessions.map(formatSessionAsNetwork))
            .then(networks => (useEgoData ? insertEgoInNetworks(networks) : networks))
            .then(networks => (exportNetworkUnion ? [unionOfNetworks(networks)] : networks));
        }

        // resolver implementation
        return this.resolverManager.getNetwork(
            protocol,
            networkOpts,
        );
    })
    .then((networks) => {
        // ... exportFiles()
    });

That means internally it has something like an implementation of:

     return this.protocolManager.getProtocolSessions(protocol._id, null, null)
            .then(sessions => sessions.map(formatSessionAsNetwork))
            .then(networks => (useEgoData ? insertEgoInNetworks(networks) : networks))
            .then(networks => (exportNetworkUnion ? [unionOfNetworks(networks)] : networks));

but simplified. It always merges the networks for instance, and implements ego type casting.

The reason it does that is because this.resolverManager.getNetwork() is also used outside of the export process in the resolve endpoint: because resolutions are iterative, we also resolve the network before passing the nodes to the resolver for further resolutions. Effectively an export without file generation.

Also, exporting without using entity resolution gives me errors -- graphml or csv.

So the reason export doesn't work? The resolution side of things has been updated to support the new sessions format, but the original export code has been untouched.

I'm not sure how much wrangling of the networks the new exporter does internally, but I'm hopeful that it will actually simplify things!

wwqrd commented 3 years ago

A minor question: could it remember that i use "python" instead of "python3" when i start a new resolution? And possibly my resolver script path? If I cancel out of the export screen and come back to export later, it doesn't remember my interpreter.

It defaults to the settings for the last resolution, so without any it'll just reset each time. I think that would be a good improvement though, and it should still maintain the effect of remembering the last resolution. Maybe it should save it in the protocol meta data?

At first I thought it surprising that if I save a resolution, and leave the export screen and come back, I can't change "Node Type", even if I am starting a new resolution. After some time, I realized these were cumulative, not alternate export options. Not really a request for anything, just a comment on my own ignorance in using a new-to-me feature. I'm not sure if there is any way to make that more clear, or if the documentation will convey that for us.

Yes that's correct. It maybe could be changed to apply a different ego cast type per resolution but it could get even more complicated. Not sure where the full messaging should live (docs?), but I will add a prompt to the UI.

If it doesn't find any matches, can we export anyway? Right now I have to untick the "Enable entity resolution" option to continue with an export that had no resolutions. But maybe the idea is to let them more easily change settings so they can get matches?

Thinking on this we will add instructions on what to do for now. but we could potentially add an export button to this panel.

Ego did not export at all? I think? At least the ego file was empty? Or maybe I misunderstand something else here!

@jthrilly is correct on this one.

wwqrd commented 3 years ago

This issue has been superseded by https://github.com/complexdatacollective/Server/issues/292