DigitalMitford / DM_SiteIndex

a repository for development of our prosopography "site index" file
https://digitalmitford.github.io/DM_SiteIndex/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Editing Site Index Add files, reformatting occupation coding #6

Open ebeshero opened 5 years ago

ebeshero commented 5 years ago

The task

Following a round of planning last summer and some major Digital Mitford project schema updates, we are formalizing a change to the encoding of <occupation> elements in the Digital Mitford Site Index. The site index is published live to support the project here https://digitalmitford.org/si.xml and its structure is fairly simple and regular: a master list of lists with formal @xml:ids for each named entity on the project, and for the <person> entries, birth and death dates/locations, roleNames and more. With a group of you working together this week at Digital Mitford Coding School, I am hoping we update all of our files containing proposed new entries for the site index (called SI_Add files) stored in a couple of directories in this repo, and possibly also the whole site index itself, to meet the new project rules for<occupation> (and anywhere else where they are not matching project schema standards).

Orientation

A method for proceeding quickly

In my efforts to update SI files I found the easiest way to proceed was thus: 1) Deal first with changing the format was by regex find and replace, or XSLT (pick your tool), and storing any text node, or at least those text nodes that contain no white spaces when you use normalize-space(), inside a @type attribute. Many of these will be invalid against our project schema, but it's just a starting point. 2) Then watch what happens when the project schema validates. It will flag everything that's wrong. 3) Find a way to update these so you do multiple of a kind at once. Where I see an invalid entry that is <occupation type="sculptor"/> for example, I use find & replace with regex to search out all of them at once (there will likely be others), and replace the capturing groups appropriately, maybe something like this (warning: I didn't test this one--just writing from memory): Find: (\stype=)("sculptor") (remember: \s is a regex white space) Replace: `\1"artist" subtype=\2

Inspect your results carefully to make sure you don't do anything ill-formed, keeping in mind that attribute values must be surrounded by quotation marks.

4) Keep going with systematically resolving the errors in a file until it becomes valid with the schema.

ebeshero commented 5 years ago

To begin, let's start you off on the files in the si_Add_Staged directory of this repo

Each of you should work on your own file to avoid merge conflicts of a hideous nature! :-) To begin, can I find a volunteer to take on one of these files in the si_Add_Staged directory?

frabbitry commented 5 years ago

I can claim si-Add_fictionalpeople_OVcorrections.xml

NADGIT commented 5 years ago

I am claiming si-Add_DRAMA_LMW_NEW&REV.xml

BMT45 commented 5 years ago

I am claiming si-Add_LMW_peopleCorrections_HtoZ.xml

frabbitry commented 5 years ago

@ebeshero Hey I just went through mine and fixed the occupation tags to the best of my ability. However, the document didn't validate because 1) there were some issues related to hashtags before id attributes and I didn't know if I was to fix those as well or how to fix those and 2) some of the occupation titles that are given as subtypes don't belong to one of the types. Still, I pushed the document and commented underneath the subtypes I was unsure about.

NADGIT commented 5 years ago

I completed mine. Like frabbitry, I had issues with the hashtag issues, and some of my occupation tags didn't seem to correlate well. I've put comments with "Fact Check" in them for later review.

ebeshero commented 5 years ago

@frabbitry and @NADGIT These are messy old project files so we can expect more things to be wrong with them. Making decisions about how to file those types and subtypes and how to reinterpret them is part of the process...but so is fixing the simple stuff like hashtag errors. Please do fix anything you see like that. I'll take a quick look at the files and see if we can do a round 2!