cwrc / islandora-etl

Islandora ETL (Extract / Transform / Load)
GNU General Public License v3.0
1 stars 0 forks source link

titles handling #14

Open ilovan opened 7 months ago

ilovan commented 7 months ago
  1. simplest scenario: only one titleInfo per level (main or related item), with no qualifying attributes and no subtitle element (by my calculation - 94,123 titles): //*[local-name()="relatedItem"]/*[local-name()="titleInfo" and not(@*)]/*[local-name()="title" and not(@*) and not(following-sibling::*[local-name()='subTitle'])] (check for length - if over 252 characters - truncate and add full_title corresponding field)

  2. slightly more complicated: only one titleInfo per level (main or related item), with no qualifying attributes and a subTitle child (by my calculation - 3,748 titles) //*[local-name()="titleInfo" and not(@*)]/*[local-name()="title" and not(@*) and following-sibling::*[local-name()='subTitle' and text()]] (concatenate title and subtitle, separate by a dot, and then calculate total length and truncate / add full title field as discussed above)

  3. even more complicated - multiple titleInfo per level there are a total of 5 objects in CWRC that contain a titleInfo element that is preceded by another titleInfo sibling but doesn’t have a type attribute. (//*[local-name()='titleInfo' and preceding-sibling::*[local-name()="titleInfo"] and not(@type)]) These, along with the titles that are typed ‘alternative’ (//*[local-name()='titleInfo' and @type='alternative']) should go in the corresponding alternative title field. About 4 alternative titles have subtitles as well, so those should be concatenated like all the other title/subtitle pairs -no need to test for # of characters since it’s not the main title field and can exceed 253 There are also 1560 instances of @type='abbreviated' , which should also be mapped to an alternative title field.

  4. titleInfo with nonSort children (2,595 objects): concatenate the nonSort content with the title content no need to fiddle with capitalization, as for the title values I have seen, the capitalization is consistent with the title language conventions. count length and truncate if need

  5. 620 descendants of titleInfo are enclosed in TEI elements - @ilovan to add a "Display title" field with full HTML formatting and provide mappings for TEI elements.

To Dos:

Spreadsheet with mappings and objects inventory: https://docs.google.com/spreadsheets/d/1S-TYcNnv3g8EQPUwqbJDVO5xpDwIHVTL/edit#gid=2097076917

jefferya commented 6 months ago

A basic implementation is in place however, a more thorough look into title is needed. I'm finding places where the above is not strictly true, for example, using the not(@*) is removing titleInfo elements with a valueURI.

Another area is if the item has multiple mods:relatedItem elements -- I'm not sure what the end result should be:

declare namespace mods = "http://www.loc.gov/mods/v3";
for $item in 
  /metadata[count(resource_metadata/(mods:mods|mods:modsCollection/mods:mods)/mods:relatedItem[mods:titleInfo])>=2]
let $id := $item/@pid/data()
order by $id
return $item