Editing Site Index Add files, reformatting occupation coding

ebeshero commented 5 years ago

The task

Following a round of planning last summer and some major Digital Mitford project schema updates, we are formalizing a change to the encoding of <occupation> elements in the Digital Mitford Site Index. The site index is published live to support the project here https://digitalmitford.org/si.xml and its structure is fairly simple and regular: a master list of lists with formal @xml:ids for each named entity on the project, and for the <person> entries, birth and death dates/locations, roleNames and more. With a group of you working together this week at Digital Mitford Coding School, I am hoping we update all of our files containing proposed new entries for the site index (called SI_Add files) stored in a couple of directories in this repo, and possibly also the whole site index itself, to meet the new project rules for<occupation> (and anywhere else where they are not matching project schema standards).

Orientation

Open the posted site index file on your computer to get a sense of it and explore it with XPath (we'll be doing that with the coding school group this week, too). You will see that the file is red and invalid according to our project schema, because of the encoding of <occupation> elements.
Take a look at //occupation and you will see a range of encodings from different stages of this project over the past few years.
- The old way was to code an <occupation> element with a text node of all kinds of unregulated stuff while we figured out a typology (a set of formal, standard categorical terms). We want standardized data here so we can drill through our site index and find out how many members of it were involved in bookselling, in trade, in the government, etc. The old way wasn't standardized, and mostly looks like this: <occupation>...text or elements...</occupation>
For the Mitford editing team inside <div type="Mitford_Team"> , their <occupation> elements are formatted differently than the rest of the site index. I updated these already--they are to include their school affiliations to help us output that on our staff page on the website: <https:digitalmitford.org/staff.html> (By the way, for those of you students helping out with the codebase this week, we'll be adding you all as student assistants to the project--we'll just have you write in a listing to add to the site index following our template here in this repo.)
The new way to encode occupation requires that the element hold an @type attribute that accepts a short list of options. It may optionally contain a @subtype attribute. @subtype attributes are keyed to types.
- For the most part, for everyone except the editing team, the <occupation/> element should be empty. (It is allowed to contain text, but the most important information in it is stored in the @type and @subtype attributes.
- Here are a couple of examples:
```
<occupation type="military"/>
<occupation type="government" subtype="politician"/>
```
Legal values for the attributes are supported by dropdown lists in the project schema, which should be associated with all the SI Add files in this repo. Those values are also listed in our documentation on Occupation Types and Subtypes--you'll want to look at these to get a sense of what subtypes go with what types.
- Other documentation for the project is available here in case you need it.

A method for proceeding quickly

In my efforts to update SI files I found the easiest way to proceed was thus: 1) Deal first with changing the format was by regex find and replace, or XSLT (pick your tool), and storing any text node, or at least those text nodes that contain no white spaces when you use normalize-space(), inside a @type attribute. Many of these will be invalid against our project schema, but it's just a starting point. 2) Then watch what happens when the project schema validates. It will flag everything that's wrong. 3) Find a way to update these so you do multiple of a kind at once. Where I see an invalid entry that is <occupation type="sculptor"/> for example, I use find & replace with regex to search out all of them at once (there will likely be others), and replace the capturing groups appropriately, maybe something like this (warning: I didn't test this one--just writing from memory): Find: (\stype=)("sculptor") (remember: \s is a regex white space) Replace: `\1"artist" subtype=\2

Inspect your results carefully to make sure you don't do anything ill-formed, keeping in mind that attribute values must be surrounded by quotation marks.

4) Keep going with systematically resolving the errors in a file until it becomes valid with the schema.

ebeshero commented 5 years ago

To begin, let's start you off on the files in the si_Add_Staged directory of this repo

This file is completed already (so you can look at it as a model to see how these should come out): si-Add_peoplecorrections_AtoH_LMW.xml

Each of you should work on your own file to avoid merge conflicts of a hideous nature! :-) To begin, can I find a volunteer to take on one of these files in the si_Add_Staged directory?

~si-Add_DRAMA_LMW_NEW&REV.xml~
~si-Add_LMW_peopleCorrections_HtoZ.xml~
~si-Add_fictionalpeople_OVcorrections.xml~

frabbitry commented 5 years ago

I can claim si-Add_fictionalpeople_OVcorrections.xml

NADGIT commented 5 years ago

I am claiming si-Add_DRAMA_LMW_NEW&REV.xml

BMT45 commented 5 years ago

I am claiming si-Add_LMW_peopleCorrections_HtoZ.xml

frabbitry commented 5 years ago

@ebeshero Hey I just went through mine and fixed the occupation tags to the best of my ability. However, the document didn't validate because 1) there were some issues related to hashtags before id attributes and I didn't know if I was to fix those as well or how to fix those and 2) some of the occupation titles that are given as subtypes don't belong to one of the types. Still, I pushed the document and commented underneath the subtypes I was unsure about.

NADGIT commented 5 years ago

I completed mine. Like frabbitry, I had issues with the hashtag issues, and some of my occupation tags didn't seem to correlate well. I've put comments with "Fact Check" in them for later review.

ebeshero commented 5 years ago

@frabbitry and @NADGIT These are messy old project files so we can expect more things to be wrong with them. Making decisions about how to file those types and subtypes and how to reinterpret them is part of the process...but so is fixing the simple stuff like hashtag errors. Please do fix anything you see like that. I'll take a quick look at the files and see if we can do a round 2!

DigitalMitford / DM_SiteIndex