The tagging saga continues.... Dashes and Quotation Marks

RJP43 commented 8 years ago

So as I am taking a closer look at a few aspects of the articles that we need to standardize. I am tossing around a few different ways to code these aspects here: @spadafour @ebeshero We will need to discuss how to edit these in past article transcriptions and the possible writing of schematron rules to fire on these for future transcriptions.

Let's first discuss dashes:

Okay so there are plenty of hyphenated words that use the regular hyphen dash − (−) I was thinking we could just incorporate this into our XSLT (that makes the reading view of the articles) so that we transform what we have been typing (the - from the keyboard) into the correct unicode (−) and/or we could require that as transcription continues in the future we put − in place of just hitting the - on the keyboard --- either way this needs to be done to avoid future issues where the browser doesn't recognize the character (and yes we have seen this happen a few times already particularly when Matt was transforming an old article markup for CDV XSLT Exercise 2 his browser was showing an unknown character for every - imputed) ---- My concern is when we put the unicode in the XML transcription oXygen throws the following error : F [ISO Schematron] The entity "minus" was referenced, but not declared. Which is because the TEI doesn't allow for these unicode options right in the text and I think we have to investigate the use of characters/punctuation further. But basically we could mark them in some way and then as a part of the XSLT grab those markers and transform them into the unicode for the HTML output. Here are a few examples (from article 1888-08-19) of when we would be using the regular hyphen (dash): four-story, devil-chaser, Paper-Box
Then we have the extended dash or the em dash which we frequently see used in the Barkley text for censored words (orgNames, streets, and persNames). So the unicode for this is — which gets us this — and again we have the same issue as before that TEI doesn't just allow for the unicode. Another concern is that we don't have a way to note this longer dash just using the keyboard beside multiple reg. dashes (---) .... So again we need to further investigate how to represent these in the text. Do we care to make the distinction? Would we want to just put one regular hyphen or keep with the three hyphens and then just do the XSLT that creats the unicode for the hyphen (as discussed above) when making out HTML reading view or do we find out a way in TEI to distinguish these two different types of dashes? This is an example (from article 1888-08-19 where we incorporated the Barkley text into an <app><rdg> version-ing setup) of when we would be using the em dash (long dash) : <app><rdg wit="#CT021">34 to 38 East Randolph</rdg><rdg wit="#WSGC23">on R---</rdg></app> street. We also see an example of this in other parts of the articles when speech text gets cut off mid sentence, like this: Can't you make it five? She just dotes on children. If she won't take him I'll be No. 2 and run for the chance. Can't you induce him to call here? We are tailoresses here, but when we appear upon the street we are --- 
So there is one other type of dash that we might consider if we see it pop up (which I have not yet, but if we come up with a TEI system we might consider including it in case it comes up) and that is the en dash (–) –

Okay my next concern is the formatting of our quotation marks and there are several uses of quotation marks in different contexts for this project that we might mark in different ways. My concerns for finding way to standardize our representation of these is so that we can use the curly quotes (single and double) and apostrophes appropriately and according to David's (@djbpitt) project suggestions from last semester.

My biggest concern is the use of quotation marks around dialogue. So for this I was thinking we should use the TEI element <q> in replacement of the pseudo-markup (quotation marks) in fact we could go through and instead of having the <said> elements we could just use the <q> elements and have our attributes previously on <said> put on the <q>. I have tested this and <q> does accept those same attributes. This would mean changing the said tags in past articles and editing out SVGs and XSLTs accordingly (which seems we will be editing these a lot either way). It would be chunky markup but we could keep the <said> with all of the attributes and have the <q> element sitting inside just replacing the pseudo-markup (quotation marks). There might be TEI reason against this so we would want to verify we aren't breaking TEI rules ( @ebeshero ) if we decide to do it like that. This is what we currently have: <said who="#employee" ana="male">"All but him binds packages; he glues."</said>
Options to change to:
<said who="#employee" ana="male"><q>All but him binds packages; he glues.</q></said> OR
<q who="#employee" ana="male">All but him binds packages; he glues.</q> OR
We make a @rend / @rendtion attribute on either the said or q that points to the use of quotes and we would need to figure that out more using this section of the TEI
Another instance quote are used is for specific phrases or emphasized words. For example in this paragraph we see two points where words are set out in quotes. Chicago, Aug. 13. - TO THE EDITOR: One who reads your articles with more than passing interest, and who deeply sympathizes with the cause of honest labor, has sufficient romance in his "make-up" to perform his part in assisting the young lady of brains referred to, and if honesty of purpose, good bringing up, etc., accompany the brains, the lady can find at the head of an honest, temperate, working-man's home a peace and comfort not found in "wearing out her young life" in pursuit of a mere existence. And we see this frequently not just in times when it could be that the person writing the editor is quoting a past article's wording. For example: Nothing short of a Philadelphia lawyer, a Chicago health officer, a proprietor or a "devil-chaser" that hits the spot once in a thousand times could, without a guide, explore the labyrinth that is known as H. Schultz & Co.'s paper-box manufactory... Because we see this multiple use I have been searching these two options in the TEI: <emph> versus <hi> and I think we could get away with just using one or the other and since we cannot be sure the intention of the quotes is emphasis <hi> seems more logical. I would like @ebeshero input on this though. @spadafour you can read more about the difference of the two here to weight in as well. Seems either choice uses the @rend / @rendtion attribute to declare how the emphasis or highlighting is marked. And may need a clarification on the difference of the two (@rend vs. @rendtion)
Sometimes we get a similar marking of specific words in single quotes instead of the double quotes and we should probably figure out a way to separate those out as well so they can be styled and transformed accordingly. Here is an example and this one in particular sits inside of a set of dialogue quotes which I have replaced with the q markup discussed above: <p><q>Are not the 'white slave' articles in THE TIMES somewhat sensational?</q></p>

RJP43 commented 8 years ago

differentiate between single and double quotes can have an @rend on the <hi> element (@nlottig94 suggestion)

just remove all quotation marks and in the XSLT that creates the HTML when we have a <said> re-input the curly quotes around the contents (@spadafour suggestion)

RJP43 commented 8 years ago

We decided to remove all quotation marks and replaces them when dialogue with the <said> elements and otherwise with the <hi> as discussed above.

As for dashed we will just use the - and transform that in the xslt or css so that it renders as the HTML unicode. The same goes for the "—" which either appears in the text already as the unicode or as a sequence of 2-3 -.

RJP43 / CitySlaveGirls

The tagging saga continues.... Dashes and Quotation Marks #48