Implement bulk actions #3

Open dpancic opened 3 years ago

In GitLab by @KlausIllmayer on Jul 1, 2021, 18:22

We need bulk actions when a lot of items are involved like approving an ingest. There are - as I see it - three ways how to handle this (and in principle, all of this three ways could operate in parallel):

Define an API endpoint where a bulk operation could be started, e.g. approve all items with status ingested from source x (disadvantage: take it or leave it - granularity could be a problem)
Have in frontend a list with checkboxes to select all the items that are involved + a button where every checkbox could be activated in bulk: after (de)selecting run the bulk action (disdavantage: pagination is a problem as you can only check the items from one page)
Use the Python notebooks to run such an operation (disadvantage: maybe necessary to write custom Python code, no full integration in the frontend)

Open for discussion to find a solution: @vronk @laureD19 @tparkola @egray523 @vronk @stefanprobst @cesareconcordia

In GitLab by @vronk on Nov 3, 2021, 13:40

mentioned in issue sshoc-marketplace-backend#128

In GitLab by @vronk on Nov 3, 2021, 13:40

marked this issue as related to sshoc-marketplace-backend#128

In GitLab by @vronk on Nov 3, 2021, 13:43

We agree to try to implement this via notebooks first. This seems more easily doable with respect to available development resources, even though it is less efficient in the runtime, because notebook has to process all the items sequentially, sending corresponding requests per item. Backend would have probably means to process the item-set more quickly server-side, however it would require implementation work on the backend, for which we currently don't have the capacity.

In GitLab by @vronk on Feb 2, 2022, 15:12

Notebooks seem mostly a solution, we keep this open as low-priority if we see that there are operations which would need server-side processing.

In GitLab by @vronk on Feb 2, 2022, 15:12

unassigned @tparkola and @tparkola

In GitLab by @laureD19 on Jun 22, 2022, 11:57

marked this issue as related to sshoc-marketplace#84

In GitLab by @KlausIllmayer on Sep 14, 2022, 10:43

The workflow would look like this:

Log in as moderator (System moderator should hopefully work, we need to check)
Get all ingested items to a source (the source should be an input parameter)
Go through all of these ingested items and approve them: call the GET endpoint, take the JSON response, and run against the PUT endpoint (by doing this as a moderator it will be approved afterwards)
Give some statistics at the end, how many items where approved

Notify @laureD19 @kreetrapper @aureon249 @cesareconcordia

In GitLab by @KlausIllmayer on Sep 14, 2022, 10:46

What we need to check: if the GET JSON response could be used 1:1 for the PUT endpoint JSON. We had issues in the past about this, they should be solved but I never checked it for all of the possible values. We could create on stage an item that has all values filled out and test it with this one, if after running the script it is 1:1 as the ingested version.

In GitLab by @KlausIllmayer on Sep 14, 2022, 12:34

The API call to get all ingested items from one specific source: GET /api/item-search?d.status=ingested&f.source=INSERT_LABEL_OF_SOURCE - you need to be logged in as a moderator to get results. You can try it out on stage where we have some suggested items from the source SSK Zotero Resources. Get the bearer token for a moderator and call on stage GET /api/item-search?d.status=suggested&f.source=SSK Zotero Resources. It should return some items (as long as no one approves this suggested items).

In GitLab by @laureD19 on Oct 26, 2022, 15:42

moved from sshoc-marketplace#89

We discussed, that we need not only a bulk approval but also a bulk reject workflow. After the ingestion pipeline (re-)harvested data from a source, these items get the status ingested and are not published. Moderators may now decide either to approve all of these items or they decide to reject all of them. Moderators do have a sample look at the ingested items and if they decide that everything was fine with the ingestion, all items of this source coming from an ingestion pipeline are approved and will become the new published version of the items. But moderators may also find problems in the items coming from the ingestion pipeline, e.g., a mapping is not valid anymore due to changes at the source, then it will be necessary to reject all items coming from this source of the ingestion pipeline.

Here is an example workflow, with the option to approve or the option to reject the items, including the API endpoints to use:

Manual preparation by moderators: Moderators get the information, that the ingestion pipeline ran against a source. There are now items in the moderation queue with the status ingested. Moderators look at these items. If there are many of them, this will be based on random samples. Especially items that were already changed on SSHOMP and also changed at source should be covered (but it could be complicated to identify such items, TODO: we should collect what kind of hints are given, if such merges happened, maybe we get a logfile from ingestion pipeline that can tell us, where to look). After inspection, moderators decide either to approve all of these items (it could be, that some special cases where already approved manually) or to reject them. TLDR: Moderators look into the ingested items from a source and decide to run the bulk action either for approving or for rejecting these items
Input parameters for the script prepared by moderators: on the one hand this will be the label of the source that was covered by the ingestion pipeline. The easiest way for moderators to get this id in the frontend is to use the facet Sources in the moderation dashboard. Choose the status facet Ingested and from the sources facet choose the source that should be handled. Copy the name of the source from the facet (unfortunately it is not possible to mark the label and CTRL+C it, you need to type it down), this name is also the label of the source that is used as input parameter. The second parameter is simple a tag that makes clear that the items in the bulk action are approved or rejected. TLDR: two parameters to be prepared and added into the script by moderators, one is "label of source" (string field, called for describing the workflow {param_source_label}) the other is either "approve items" or "reject items" (boolean field or maybe better to make it clearer a string field that must be either "approve" or "reject", called for describing the workflow {param_applied_action})
Script signs into SSHOMP as moderator: POST /api/auth/sign-in
Script checks if source is valid: GET /api/sources and look into the result if the {param_source_label} can be found in one of the labels (if not, give an error message)
Script collects as moderator all ingested items to the source param_source_label: GET /api/item-search?d.status=ingested&f.source={param_source_label} (be aware to url-encode the {param_source_label} as it can contain spaces or other special characters); print the statistics (number of found items)
Script collects detail information of all ingested items: we like to get the information if an ingested item created a conflict-at-source and build a table with short information of all affected items, for this do a GET /api/{category}s/{persistentId}/versions/{id} (the category that you get in item-search is singular but it must be here plural therefore the additional s; I think we have a method that maps to the correct category in the API call, that would be safer) of all items found at step 5, the table should consist of field persistentId, field category,field label, field lastInfoUpdate, the review link to item in the frontend: {frontend-url}/{category}/{persistentId}/version/{id}/review and if it exists the value of the property conflict-at-source (in the json it can be found in {"properties"}[]{"type"}{"code"="conflict-at-source"} - the value is on the same level as "type" and identified as "value") - sorted by first label and second persistentId
Script processes every item from the table:
- If {param_applied_action} is approve script approves the item: to do this, you revert as moderator the ingested version which makes it the published one with PUT /api/{category}s/{persistentId}/versions/{id}/revert (beware that category is plural therefore the additional s, see also comment in step 6) check the http-return-code, it must be "200" if everything went okay, otherwise give an error message
- If {param_applied_action} is reject script rejects the item: to do this, delete as moderator the ingested version with DELETE /api/{category}s/{persistentId}/versions/{id} (beware that category is plural therefore the additional s, see also comment in step 6)

@cesareconcordia I hope the workflow is now more clearer and I hopefully covered all necessary steps, if not, please comment in this issue. There are also examples on the stage and on the development instance of marketplace.

Things to check together with @laureD19:

We should check on examples, what happens if an item at the source changed and the item also changed in the SSHOMP, e.g. the description. Will the ingested item prefer the change at the source or the change in the SSHOMP? Or will it be a combination of the differences?
Connected to this: if an item does have a conflict-on-source-property that has the value true I guess a moderator needs to check it manually and also needs then to remove the conflict-on-source-property (have a look at https://github.com/SSHOC/sshoc-marketplace-backend/issues/85#issuecomment-1352878754 for more information on this property)

Adding a point which needs to be handled: if there are two re-ingests of the same source, we will have of one item two different versions. This can be seen in the table, as there will be two entries with the same persistentId and label. In such cases, we need an agreement how to handle such a situation. Going through the proposed workflow alone could lead to irritating results (depending on the sort algorithms it may approve a version of the first ingest or of the second ingest). Most probably, the script will also run in an error, as the approval of a version will reject all other versions, that are then not available anymore. I guess the best solution is in such cases to only take the newest version of an item, handle this version and if it is an approve, the other versions will disappear (if it is a reject, the other versions won't disappear, so I guess we like to reject also the other versions) Opinions on this @laureD19 @cesareconcordia ?

@KlausIllmayer : workflow seems clear, thanks. Will talk about its implementation during the next EB call

initial test for bulk rejection of ingested items available here

could you have a look @cesareconcordia and @KlausIllmayer and tell me what should be improved?

review and create smaller tasks

SSHOC / marketplace-curation

Implement bulk actions #3