ErwinKomen / RU-passim

0 stars 0 forks source link

Data - CPPM - Action list for import #710

Open shariboodts opened 9 months ago

shariboodts commented 9 months ago

Actions:

A. Import of authority files (first step):

Every text (Authority file) has an entry, key "seeker_signature" provides under key "id" the identifier for PASSIM authority files and also "sermongold_id" and "equalgold_id".

Key "mapped": false means this authority file is seemingly not present in PASSIM

  1. Import these items as new authority files in PASSIM, remembering to connect to the relevant manuscript from CPPM-data ("manuscripts_uniform.json").

B. Import of manuscript data (second step):

On the basis of "manuscripts_uniform.json" (contains list of entries, each entry equals a manuscript)

key "mapped": false means this manuscript is not present in PASSIM

  1. In PASSIM mapping is provided Library_ID as in PASSIM and name of PASSIM Institution consisting of City + Library
  2. Import these items as new manuscripts in PASSIM, using established connection to Library

C. Connections between the texts: CPPM-data have a lot of links between authority files (third step).

To import these, use key "equality set" (list of numbers to which should be added CPPM I and then number listed in the file). Connections between texts are specified according to fields for link types in PASSIM (e.g. unspecified, uses, etc)

D. Import of manuscript data (fourth step):

Key "mapped": true means this manuscript is potentially present in PASSIM

  1. In this case PASSIM mapping might contain several suggestions (manuscript is present several times in PASSIM)
  2. If mapping contains only one suggestion, and manuscript is green in PASSIM and all items in the manuscript in CPPM-data are mapped: true, do not import.
  3. If mapping contains only one suggestion, and manuscript is orange or red, import items which are under key "msitems": check: are msitems in CPPM present in PASSIM manuscript description as well > if yes, check whether items in PASSIM have data in fields location in manuscript, connection to AF, incipit, explicit, attribution > if yes, do not import in corresponding fields, but put all CPPM-data in Notes field; if no, add CPPM-data to corresponding fields if it is available.
  4. If mapping contains more than one suggestion, take no action on these items (will be processed by Rafael).

EXECUTE UP TO THIS STAGE

E. Import of authority files (fifth step):

Key "mapped": true means there appears to be a match for authority file in PASSIM (problem here is what to do about contradictory information concerning connections between the CPPM-items)

shariboodts commented 9 months ago

Overview of preprocessing action undertaken by Gleb:

First of all, I have not taken into account two parts of the data:

For the rest, I proceed as follows.

Manuscripts:

I started by normalising all the shelfmarks so that one manuscript is always referred to in the same way. At this stage, I’ve also parsed the dates so that the syntax of dates conforms PASSIM usus. Then, using AI, I found, for all the “mappable” manuscripts, one or multiple PASSIM candidates. For those without a PASSIM counterpart, I’ve mapped them to one of the libraries in PASSIM.

As a result, the attached “manuscripts_uniform.json” contains ready-to-use import suggestions for all the manuscripts in CPPM, as well as manifestations provided by these codices: locations and attributions, if provided.

Texts;

Working with texts, I’ve also somewhat limited my scope. I’ve decided to focus on the information about intra-CPPM textual connections (equality, fontes, usus) and editions. Therefore, the broader bibliography is not parsed so far.

All the connections between two CPPM entries described in “No” (notes) or “Ti” (title) were considered indications of equality. Connections described in “Fo” (fontes) and “Us” (usus) were expressed using PASSIM “uses” and “used by” syntax. “Us-Fo” (usus-fontes) were rendered as “relates”. (Note Shari: I would like to run some tests on this, to see if the principle of equality for these entries holds true)

At this stage, I had to disregard CPL references, since, unfortunately, they are regularly used as bibliographical references. For this reason, it is often impossible to say, without reading the entry, if it is, actually, a relationship which is being described or merely a reference is given.

For every single CPPM entry, I’ve tried to find a sermon gold with the same code. For most of entries, there already were relevant PASSIM records. Id’s of this records as well as the id’s of the corresponding AF’s are provided.

This data is in the “cppm_texts.json”

Problems:

Unfortunately, in many cases well-structured information is not marked-up. This is, above all, the case with sources. Sometimes, references to medieval works are marked-up, which makes it possible to extract them, but in most cases everything except a CPPM number (if available) is given as plain text, which is regrettable.

HÜWA already reflects a lot of connections described in CPPM, which means that it’ll be complicated to spot redundancies. However, according to my observations, HÜWA often lacks to describe some of the intra-CPPM relationships.

For CPPM entries without PASSIM sermon golds, it is possible to run an AI-driven deduplication procedure over the entire corpus of incipits and title (like I did for manuscripts), but I decided not to do immediately. manuscripts_uniform.json cppm_texts.json

ThijsRU commented 5 months ago

Fase 1 and 2 online