ErwinKomen / RU-passim

0 stars 0 forks source link

Template for bulk data import #741

Open shariboodts opened 7 months ago

shariboodts commented 7 months ago

As we noticed with Elisa's request, data import can still be more complicated than anticipated. It would be good to have a clear pipeline for this, including:

  1. An (excel?) template for manuscript description categories and authority file description categories
  2. A pipeline to check:
    • Whether the manuscript already exists in PASSIM (if so, only manual checking or completion is possible)
    • Whether the reference codes (Gryson/Clavis) for the manifestations match an existing authority file (if so, automatic linkage is possible).
  3. A pipeline to create new authority files and link them to the manifestations of manuscript descriptions.

This would allow users to import their own datasets, but with a simple security measure: it can only be done when you have editor status and a project label to which your editor status is connected, so you have to be in touch with PASSIM admin to make this possible. It would mean, however, that for smaller bulk imports, no help from ICT developers is necessary.

ErwinKomen commented 4 months ago

This issue divides into two parts: (1) Manuscripts, (2) Authority Files

Part 1: Manuscript import

Lessons learned

Having worked with previous Manuscript data import, these are the lessons learned from it, some of which translate immediately into a test:

  1. An editor might give a manuscript Excel other names for tab pages than expected.
    1. Add test: each Excel may only have two tab pages, and their names must be "Manuscript" and "Sermons"
    2. Todo: provide a template with just this
  2. A user may enter the wrong data type in the 'order' column in the Sermons tab page (i.e. "2bis" is a string instead of "2", which is a number)
    1. Add test: validate the data type in each field in the two tab pages.
    2. If a cell is found to be of a faulty datatype, then provide a readable error message to it
    3. Make sure to validate all cells before completing an error report to the user
  3. A user may enter signs like '[' and ']' into a text field (like incipit, explicit), which are not there originally anyway, and which cause havoc in the processing of those fields. This is like a semantic error (type of field is correct, but content is not).
    1. Add test: specifically test whether incipit and explicit would be acceptable
    2. Then have an error report deliver readable information about this to the user
  4. The 'stringified JSON' data provided by the user in the Gryson/Clavis field may be semantically wrong, e.g: ["AU Fau 12, 30+AU Ps 113, 1, 3-5+AU Jo 11, 4.16-41+AU Jo 26, 11.2-20, 11.41-12+AU Jo 45, 9.25-39"] might be the wrong way to add data that should actually be: ["AU Fau 12, 30", "AU Ps 113, 1, 3-5", "AU Jo 11, 4.16-41", "AU Jo 26, 11.2-20, 11.41-12", "AU Jo 45, 9.25-39"]
    1. The 'machine' cannot distinguish this intelligently, so this may end up as a Gryson/Clavis mark
    2. Unless there is a consensus about the format of a Gryson/Clavis signature;
      1. that it should satisfy a particular format (e.g. letters + spaces, followed by a number and that is it)
      2. that a signature has a maximum length
  5. A user may enter wrong library information (e.g. Bodleian Library from the city Oxford in the country Italy)
    1. This is a semantic error. I don't see how this can be prevented automatically.
    2. It should not be possible to enter this anyway, since [Bodleian Library, Oxford, XXX] already exists. See next error.
  6. A user may try to enter a manuscript, whose shelfmark is already in Passim
    1. I would say that such information should just not be accepted. No user/editor may add (or overwrite) a manuscript whose shelfmark [library - city - identifier] already exists.
    2. Todo: add this to error report

Curation process

An editor (with upload-rights?) should be able to 'submit' a manuscript (defined in an Excel) for import in this way:

  1. Phase 1: The editor attempts to import a manuscript, and the first step would be that an import report is created for that particular (uploaded) manuscript.
    1. If there are errors, an error report is shown to the user, allowing him/her to correct the mistakes in the Excel
    2. If the import report identifies no errors, then a green button may appear, allowing the editor-user to continue to the next phase.
  2. Phase 2: when a manuscript Excel passes the test, then the green button submit to moderator serves to submit the import request to someone in Passim with moderator rights.
    1. The moderator will have a special section "Import requests to approve"
    2. When a user has pressed the submit to moderator button:
      1. the import process for that manuscript is 'locked': the user is not able to make any more changes (e.g. upload a new version of the manuscript)
      2. users who are in the passim_moderator group will receive the request on the home page (or the MyPassim page?) in the form of a line that says, for example: "There are 5 manuscript import requests pending". The moderator can press on that line/button and enter a listview with manuscripts to be approved. Each manuscript import details page ...
        1. ... contains a download-Excel button for the moderator
        2. ... contains a field where the moderator can add notes for the request
        3. ... contains buttons 'reject' and 'approve'.
    3. The user/editor who made the import manuscript request will have something on his/her homepage (or MyPassim page) where the handling of the import requests is followed
ErwinKomen commented 4 months ago

Implementation of curated manuscript import phase 1

  1. Phase 1:
    1. Create a model and detail/listviews for an Excel manuscript import item: ImportSet
      1. Make sure the Excel is kept on a logical place
    2. Facilitate a manuscript import, e.g. via MyPassim > Imports
      1. Add view for My Import Requests in MyPassim: MyPassimEdit > get_related_objects()
      2. Allow clicking through, re-ordering etc in MyPassim
      3. When a user changes the import manuscript, all else must reset
    3. Add Orange button for verification
      1. Functionality:
        1. Correct number of and naming of worksheets
        2. Correct content of worksheet Manuscript
        3. Manuscript should at least have: shelf mark, country, city and library
        4. Correct content of worksheet Sermons

image

ErwinKomen commented 4 months ago

Implementation of curated manuscript import phase 2

  1. Phase 2:
    1. When Excel file is without errors: show Green button for submit
      1. Note: submit means - a moderator will have a look at it and possibly approve or reject it
    2. Create list view, edit view, details view for ImportReview
    3. Create download view so that moderator can download the Excel associated with an ImportSet
      1. Created ImportSetDownload - done
      2. Get this working as a POST download... - ok, working now
    4. Status changes in ImportSet and ImportReview
      1. When ImportSet is set to "sub" (submitted) - ImportReview must be set to "chg" (changed)
      2. When ImportReview receives a verdict
        1. Either "rej" or "acc": no longer show the 'verdict' buttons (or show them grayed)
        2. When ImportReview is set to "rej" - ImportSet must be set to "rej"
        3. When ImportReview is set to "acc" - ImportSet must be set to "acc" + import must take place
    5. Add link to ImportSet object from ImportReview details view - done
    6. Add link to imported manuscript in the view of ImportSet - done
    7. Get ImportReview on the MyPassim site of anyone who is moderator
      1. Need to have field order in ImportReview after all, to show it correctly in MyPassim
      2. Okay, working
    8. Strategy / system: who is the moderator?
      1. Need to keep the moderator 'open' (None), until a moderator wants to 'look' and edit an ImportReview item
        1. When ImportReview is created (i.e. submitter presses submit), moderator may not be filled in yet
        2. For all moderators, the "Excel import reviews" should show 'all ImportReview' items, whether they belong to a moderator or not
        3. Any moderator who does "reject" or "accept" becomes the moderator for that review

Okay, should be working now. TODO: final double check...

Remaining issues

  1. The project assignment should be visible by the user, and that assignment should be kept, when the manuscript is imported via do_import()
  2. Right now there can be errors related to processing bibrefs. But those errors are not shown to the user, nor have they been assigned to either 'Warning' or 'Error' (the latter category means: import manuscript is not possible).
  3. When an Import is rejected: the submitter should get to know the ImportReview details (moderator, notes of review, date)

Implementation

  1. Project assignment
    1. Added model ImportSetProject to provide the link between import and project
    2. Default saving of ImportSet: this attaches the default_projects to the ImportSet
    3. Adaptation:
      1. Extended form with projlist + only allow projects for which user has edit rights - works
      2. Added relevant code to after_save() to process the m2m - works
    4. Import review also gets to see the projects to which an import will be assigned
    5. When the review accepts the import, the project assignment should be taken up via do_import()
      1. Made changes in custom_add() of SermonDescr and of Manuscript, so that projects can be transferred via kwargs and applied accordingly
  2. Bibref error processing
ErwinKomen commented 4 months ago

Part 2: Authority File import

no time

ErwinKomen commented 3 months ago

Create Excel template, to be downloadable

  1. Created Excel file
  2. Added Excel file to special folder
  3. Added download view ImportSetManuTempDownload for this file
  4. Added a button in MyPassim to download this template
shariboodts commented 3 months ago

TODO: copy description to import manual (task for myself after launch)