Open ebeshero opened 5 years ago
To begin, let's start you off on the files in the si_Add_Staged directory of this repo
Each of you should work on your own file to avoid merge conflicts of a hideous nature! :-) To begin, can I find a volunteer to take on one of these files in the si_Add_Staged directory?
I can claim si-Add_fictionalpeople_OVcorrections.xml
I am claiming si-Add_DRAMA_LMW_NEW&REV.xml
I am claiming si-Add_LMW_peopleCorrections_HtoZ.xml
@ebeshero Hey I just went through mine and fixed the occupation tags to the best of my ability. However, the document didn't validate because 1) there were some issues related to hashtags before id attributes and I didn't know if I was to fix those as well or how to fix those and 2) some of the occupation titles that are given as subtypes don't belong to one of the types. Still, I pushed the document and commented underneath the subtypes I was unsure about.
I completed mine. Like frabbitry, I had issues with the hashtag issues, and some of my occupation tags didn't seem to correlate well. I've put comments with "Fact Check" in them for later review.
@frabbitry and @NADGIT These are messy old project files so we can expect more things to be wrong with them. Making decisions about how to file those types and subtypes and how to reinterpret them is part of the process...but so is fixing the simple stuff like hashtag errors. Please do fix anything you see like that. I'll take a quick look at the files and see if we can do a round 2!
The task
Following a round of planning last summer and some major Digital Mitford project schema updates, we are formalizing a change to the encoding of
<occupation>
elements in the Digital Mitford Site Index. The site index is published live to support the project here https://digitalmitford.org/si.xml and its structure is fairly simple and regular: a master list of lists with formal@xml:ids
for each named entity on the project, and for the<person>
entries, birth and death dates/locations, roleNames and more. With a group of you working together this week at Digital Mitford Coding School, I am hoping we update all of our files containing proposed new entries for the site index (called SI_Add files) stored in a couple of directories in this repo, and possibly also the whole site index itself, to meet the new project rules for<occupation>
(and anywhere else where they are not matching project schema standards).Orientation
Open the posted site index file on your computer to get a sense of it and explore it with XPath (we'll be doing that with the coding school group this week, too). You will see that the file is red and invalid according to our project schema, because of the encoding of
<occupation>
elements.Take a look at
//occupation
and you will see a range of encodings from different stages of this project over the past few years.<occupation>
element with a text node of all kinds of unregulated stuff while we figured out a typology (a set of formal, standard categorical terms). We want standardized data here so we can drill through our site index and find out how many members of it were involved in bookselling, in trade, in the government, etc. The old way wasn't standardized, and mostly looks like this:<occupation>...text or elements...</occupation>
For the Mitford editing team inside
<div type="Mitford_Team">
, their<occupation>
elements are formatted differently than the rest of the site index. I updated these already--they are to include their school affiliations to help us output that on our staff page on the website: <https:digitalmitford.org/staff.html> (By the way, for those of you students helping out with the codebase this week, we'll be adding you all as student assistants to the project--we'll just have you write in a listing to add to the site index following our template here in this repo.)The new way to encode occupation requires that the element hold an
@type
attribute that accepts a short list of options. It may optionally contain a@subtype
attribute.@subtype
attributes are keyed to types.<occupation/>
element should be empty. (It is allowed to contain text, but the most important information in it is stored in the@type
and@subtype
attributes.Legal values for the attributes are supported by dropdown lists in the project schema, which should be associated with all the SI Add files in this repo. Those values are also listed in our documentation on Occupation Types and Subtypes--you'll want to look at these to get a sense of what subtypes go with what types.
A method for proceeding quickly
In my efforts to update SI files I found the easiest way to proceed was thus: 1) Deal first with changing the format was by regex find and replace, or XSLT (pick your tool), and storing any text node, or at least those text nodes that contain no white spaces when you use
normalize-space()
, inside a@type
attribute. Many of these will be invalid against our project schema, but it's just a starting point. 2) Then watch what happens when the project schema validates. It will flag everything that's wrong. 3) Find a way to update these so you do multiple of a kind at once. Where I see an invalid entry that is<occupation type="sculptor"/>
for example, I use find & replace with regex to search out all of them at once (there will likely be others), and replace the capturing groups appropriately, maybe something like this (warning: I didn't test this one--just writing from memory): Find:(\stype=)("sculptor")
(remember:\s
is a regex white space) Replace: `\1"artist" subtype=\2Inspect your results carefully to make sure you don't do anything ill-formed, keeping in mind that attribute values must be surrounded by quotation marks.
4) Keep going with systematically resolving the errors in a file until it becomes valid with the schema.