SSHOC / marketplace-curation

Project to manage scripts and auxiliary data, via Python library and Jupyter notebooks, for the curation of the SSH Open Marketplace
0 stars 0 forks source link

Curating MP actors #5

Closed dpancic closed 7 months ago

dpancic commented 2 years ago

In GitLab by @laureD19 on Nov 21, 2022, 11:55

Few experiments have already been conducted to curate MP actors, esp. methods to identify duplicates & merge them (see for ex.#13) and identify multi-values to disambiguate (see #4) are set up.

These methods are currently summarised in an actor notebook. As we - @aureon249 and me - are trying to re-open the actors curation activities, we've noticed several points that could be checked/improved in the notebook. @cesareconcordia could you have a look at the following, please, and let us know what you think?

  1. would it be possible to automatise the merging steps for the more than 2800 duplicates found. Currently the notebook follows the merge POST and curators need to manually add actor IDs to perform the merge. Would it be possible to automatise this step? Meaning that based on the list of exact matches the merge would be perform automatically, without the comparison step currently implemented. This would be needed at least for this initial and large merge operation (and prevent Martin and me to perform 1 400 merge post manually! :))
  2. because we discovered that it is not possible to "delete" actors (cf. #be143), actors not attached to any items can not be deleted, but need to be merged with existing ones. So section 3 of the current notebook and associated methods become a bit useless.
  3. for the multi-value actors to disambiguate, would it be possible to get an output creating one line per actor ID, instead of having an item list?

Once these points are implemented and allow manual correction, a more regular workflow will be set up, following what was foreseen in #10.

notify also @KlausIllmayer @vronk

dpancic commented 1 year ago

In GitLab by @cesareconcordia on Dec 7, 2022, 12:50

The notebook 4.3ActorsCuration-Duplicates will implement the workflow to curate duplicated actors. The first release considers duplicated actors those having the same name and website, have a look at it and let me know what you think. In some cases an actor has more than one duplicates, how these could be merged?

laureD19 commented 1 year ago

Thx a lot, @cesareconcordia for 4.3 ActorsCuration-Duplicates.

A few comments after we had a look at it with Martin @aureon249 . IDs refer to production data.

  1. A few cases seem to present a problem where IDs to merge are actually the same one (only three cases identified among the 716 IDs: 2587, 2075 and 3020). Maybe would be good to add a kind of control check to exclude this possibility.

  2. Regarding your question to merge more than two actors, it is actually possible using the same api POST : /api/actors/{id}/merge?with={id}&with={id}&with={id}.... Tested on the stage instance to merge up to 5 IDs. It worked without any problem.

  3. When it comes to other use cases than the name+website comparison, here is our suggestion with @aureon249 :

    • Different actors with the exact same name that were once attached to the same item(s) => could be merged without individual investigation. Example: actors 2411 & 564
    • Different actors with the exact same name that were never attached to any items => could be merged without individual investigation. Example: 1665 and 3210.
    • Different actors with the exact same name that were never attached to the same items => comparison step and further investigation needed before deciding if merging or not. Example: 4689 and 9111
  4. A general comment re. the comparison of actors duplicates, it would be great to add a view of the items an actor is attached to in the comparison and show differences step.

notify also @KlausIllmayer, @vronk and @kreetrapper

cesareconcordia commented 1 year ago

Hello @laureD19 @aureon249 ,

  1. done, added a control check in the library function to prevent this
  2. implementation done, I'm going to test it on the stage instance, will let you know
  3. implementation in progress for the proposals a) (actors with the same name that were once attached to the same item(s)) and b) (actors with the same name that were never attached to any items)
  4. investigating
cesareconcordia commented 1 year ago

hi, i’ve commited the new release of notebook 4.3 - Actors curation: duplicates in the dev branch. It has been tested with no errors in the stage dataset. Have a look at it and let me know.

laureD19 commented 1 year ago

hi :) thanks a lot, Cesare!! A few observations after I tested it on stage and prod data:

cesareconcordia commented 1 year ago

hello Laure, all, I've uploaded a new version of the Notebook 4.3 (dev branch): 1) the bug spotted by Laure seems fixed, please check, 2) there is a section that find and merges actors duplicated by 'name'. Let me know if there are problems.

cesareconcordia commented 1 year ago

A new version of notebook 4.3 has been uploaded, overall behaviour:

There are some special cases: for instance the ids 7812, 149, 1954 refer to the three actors with the same name, 2 of them have one item in common but the third has no items in common with the others, this case is added to the dataframe of actors not having items in common. Let me know if it is correct and feel free to ask any questions (and let me know if it works :)

laureD19 commented 1 year ago

Thx @cesareconcordia ! I've tested this new version on stage and prod data. A few comments based on my tests:

I think we are ready to proceed on prod. We meet this Friday with @aureon249 , and if I don't hear back from some of you by then, we would probably run 4.3 on production data.

laureD19 commented 1 year ago
cesareconcordia commented 1 year ago

There is a new nb that deletes Actors not attached to any item. After a test on 'stage', it has been executed on the production instance: 892 actors have been individuated, 379 of them have been successfully deleted, 513 Actors cannot be deleted at the moment since they are connected to deleted items or affiliated to existing Actors, maybe the API entry for removing actors needs a 'force=true' option.

KlausIllmayer commented 1 year ago

@cesareconcordia created the script to delete actors that are not connected to items, it has this workflow:

  1. it gets all actors from the MP using api/actors
  2. it creates a dataframe with all the actors where item list returned by api/actors/{id}/items=true is empty
  3. it deletes, using force=true , every actor in this dataframe (workflow information by Cesare)

As currently the force=true for deleting actors is not implemented (see https://github.com/SSHOC/sshoc-marketplace-backend/issues/388) it is only possible to delete actors that were never connected to any items. Even if an actor was only connected to deleted items, it will be not possible to delete this actor without the force-parameter. Additionally, if an actor is an affiliation of another actor, such actors also can't be deleted. Thus, not all actors can be deleted that are listed at step 2 of the workflow.

We found out, that there are actors without connection to any items that do have affiliations to actors that are also without connection to any items. Instead of creating an algorithm to order the deletion of actors so that this use cases are covered - which also lacks adequate API endpoints - we decided to simple run the script 2x so that it covers such situations. The result of the script was already reported by Cesare in this issue.

What we need to discuss @laureD19: if the force option is implemented, do we also like to forcefully delete actors who do not have any connections to an item but act as affiliations for other actors? I think, we need to find a more clearer rule what to do with such actors.

cesareconcordia commented 1 year ago

In the dev branch there is a notebook that reads the possibly duplicated actors from the duplicated actors gsheet and executes merge. It has not been tested on production, be sure to make a back up of the dataset before using it. Let me know if there are problems.

laureD19 commented 1 year ago

166 actors merged this morning thanks to Cesare's notebook and our manual checks in the gsheet with Martin!

Remains 81 duplicates we need to re-inspect and manually curate before - perhaps for some of them - proceeding with a merge.

laureD19 commented 1 year ago

69 actors merged running this notebook after the second check performed with Martin here (column e).

From 7613 to 7535 actors.

Next steps in actors curation:

  1. manual curation of the 11 duplicates with N in column e
  2. manual curation of special characters
  3. Reopening multi value actors curation based on:
laureD19 commented 7 months ago

closing this issue to break it down in more manageable ones.