Closed dpancic closed 7 months ago
In GitLab by @cesareconcordia on Dec 7, 2022, 12:50
The notebook 4.3ActorsCuration-Duplicates will implement the workflow to curate duplicated actors. The first release considers duplicated actors those having the same name and website, have a look at it and let me know what you think. In some cases an actor has more than one duplicates, how these could be merged?
Thx a lot, @cesareconcordia for 4.3 ActorsCuration-Duplicates.
A few comments after we had a look at it with Martin @aureon249 . IDs refer to production data.
A few cases seem to present a problem where IDs to merge are actually the same one (only three cases identified among the 716 IDs: 2587, 2075 and 3020). Maybe would be good to add a kind of control check to exclude this possibility.
Regarding your question to merge more than two actors, it is actually possible using the same api POST : /api/actors/{id}/merge?with={id}&with={id}&with={id}.... Tested on the stage instance to merge up to 5 IDs. It worked without any problem.
When it comes to other use cases than the name+website comparison, here is our suggestion with @aureon249 :
A general comment re. the comparison of actors duplicates, it would be great to add a view of the items an actor is attached to in the comparison and show differences step.
notify also @KlausIllmayer, @vronk and @kreetrapper
Hello @laureD19 @aureon249 ,
hi, i’ve commited the new release of notebook 4.3 - Actors curation: duplicates in the dev branch. It has been tested with no errors in the stage dataset. Have a look at it and let me know.
hi :) thanks a lot, Cesare!! A few observations after I tested it on stage and prod data:
If I didn't misunderstand something, I believe there is a bug with the "Different actors with the exact same name that were never attached to the same items" method (= df_actors_with_different_items), as it gives me results - for ex. actors ids [842, 2720] - that are attached to the same item when I check via the API (https://marketplace-api.sshopencloud.eu/api/actors/842?items=true and https://marketplace-api.sshopencloud.eu/api/actors/2720?items=true). These cases should be included in the third df (df_actors_with_same_items) so that we can merge them without manual checks.
In general, I think same name + same website is a good enough condition to (directly) merge duplicates identified on this basis, and that the approach checking if actors have ever been attached to the same items could be kept to deal with the duplicates identified based on identical name only. For ex. with prod data, there are 715 duplicates name+website, but 2837 duplicates based on name only. So I think it would make sense to have a flow like the following in the notebook: 1. name+website duplicates; 2. merge names+website duplicates; 3. name duplicates; 4. check if these name duplicates are attached to any item/ same items/ not same items
hello Laure, all, I've uploaded a new version of the Notebook 4.3 (dev branch): 1) the bug spotted by Laure seems fixed, please check, 2) there is a section that find and merges actors duplicated by 'name'. Let me know if there are problems.
A new version of notebook 4.3 has been uploaded, overall behaviour:
There are some special cases: for instance the ids 7812, 149, 1954 refer to the three actors with the same name, 2 of them have one item in common but the third has no items in common with the others, this case is added to the dataframe of actors not having items in common. Let me know if it is correct and feel free to ask any questions (and let me know if it works :)
Thx @cesareconcordia ! I've tested this new version on stage and prod data. A few comments based on my tests:
I think we are ready to proceed on prod. We meet this Friday with @aureon249 , and if I don't hear back from some of you by then, we would probably run 4.3 on production data.
There is a new nb that deletes Actors not attached to any item. After a test on 'stage', it has been executed on the production instance: 892 actors have been individuated, 379 of them have been successfully deleted, 513 Actors cannot be deleted at the moment since they are connected to deleted items or affiliated to existing Actors, maybe the API entry for removing actors needs a 'force=true' option.
@cesareconcordia created the script to delete actors that are not connected to items, it has this workflow:
api/actors
api/actors/{id}/items=true
is emptyAs currently the force=true
for deleting actors is not implemented (see https://github.com/SSHOC/sshoc-marketplace-backend/issues/388) it is only possible to delete actors that were never connected to any items. Even if an actor was only connected to deleted items, it will be not possible to delete this actor without the force-parameter. Additionally, if an actor is an affiliation of another actor, such actors also can't be deleted. Thus, not all actors can be deleted that are listed at step 2 of the workflow.
We found out, that there are actors without connection to any items that do have affiliations to actors that are also without connection to any items. Instead of creating an algorithm to order the deletion of actors so that this use cases are covered - which also lacks adequate API endpoints - we decided to simple run the script 2x so that it covers such situations. The result of the script was already reported by Cesare in this issue.
What we need to discuss @laureD19: if the force option is implemented, do we also like to forcefully delete actors who do not have any connections to an item but act as affiliations for other actors? I think, we need to find a more clearer rule what to do with such actors.
In the dev branch there is a notebook that reads the possibly duplicated actors from the duplicated actors gsheet and executes merge. It has not been tested on production, be sure to make a back up of the dataset before using it. Let me know if there are problems.
166 actors merged this morning thanks to Cesare's notebook and our manual checks in the gsheet with Martin!
Remains 81 duplicates we need to re-inspect and manually curate before - perhaps for some of them - proceeding with a merge.
69 actors merged running this notebook after the second check performed with Martin here (column e).
From 7613 to 7535 actors.
Next steps in actors curation:
closing this issue to break it down in more manageable ones.
In GitLab by @laureD19 on Nov 21, 2022, 11:55
Few experiments have already been conducted to curate MP actors, esp. methods to identify duplicates & merge them (see for ex.#13) and identify multi-values to disambiguate (see #4) are set up.
These methods are currently summarised in an actor notebook. As we - @aureon249 and me - are trying to re-open the actors curation activities, we've noticed several points that could be checked/improved in the notebook. @cesareconcordia could you have a look at the following, please, and let us know what you think?
Once these points are implemented and allow manual correction, a more regular workflow will be set up, following what was foreseen in #10.
notify also @KlausIllmayer @vronk