SASDigitalHumanitiesTraining / TextEncoding

Text Encoding for Ancient and Modern Literature, Languages and History
9 stars 5 forks source link

Discussion of text encoding exercise #2

Closed gabrielbodard closed 3 years ago

gabrielbodard commented 3 years ago

Please use this thread to discuss the text encoding exercise (your markup of the page of the Dunciad), thinking in particular, but not exclusively, about the following questions:

  1. Is Markdown or HTML better suited to encoding this example? (Is there any difference?)
  2. What might we want to be able to encode in this text that Markdown/HTML doesn’t allow?
  3. What features might you want to add that are only possible in the digital medium?
  4. Do you think you have to sacrifice display for semantics? (Or vice versa?)
  5. Who is the imagined audience of your web page? Does that affect the decisions you made?
sergiobassocina commented 3 years ago
  1. Markdown is a markup language with simple text syntax designed so that it can be converted to HTML and many other formats using a tool of the same name; intuitively, opting for either Markdown or HTML depends on how much the user feels comfortable with HTML. Markdown has 1 apparent advantage, it can be converted to HTML, so it sounds more flexible; however, legion of converters are available online to translate HTML in other markup languages, so this advantage is meaningless. On the top of that, Markdown has a substantial minus: it is a form of abandonware.
sergiobassocina commented 3 years ago
  1. We might want to encode spaces, tabs or margins, which HTML does not We might want to encode the font, whereas HTML doesn't support all font styles. HTML doesn't recognize line breaks, or paragraph breaks (unless you code it in).
sergiobassocina commented 3 years ago

I see questions 4 & 5 as strictly related. Encoding depends from why I am encoding: to what use and for what (potential) audience. "Sacrificing display for semantics" depends on who is going to use my encoding for what. Encoding every possible feature is an utopia - or better, thanks to on-line team encoding, an encoding process based on numerous encoders, with different needs and then different approaches, might reflect and guarantee a broader spectrum of encoded features, that might pander to a broader (potential) palette of users.

hannah-sonbol commented 3 years ago

To nr. 3 & 5: In egyptology the online-dictionary Thesaurus Linguae Aegyptia have included words as well as trasnlated texts, while both are interlinked: The words are categorized depending on their grammatical function (verb, substantive, adverb, adjective etc.) giving the academic the possibility to analyse the words on the one hand (for example: what is the most used preposition in ancient Egypt?) and analysis on the text encoded (what are the words used most / used less in the text?). There are many possibilities to go further and it depends on the question:

So it really depends on the question you put to it and who should read it (academic, non-academic).

FlannelBanshee commented 3 years ago

I think so far this is like Chris or Gabby said during the introduction, this is a writing tool, not a web publishing tool. There is little to no semantic markup that from what I can tell (though at this point I am just a babe in the woods), so it's just more like representational stuff. I do think it's interesting because it's not proprietary, like so many basic styling things are even though we don't tend to think about it in the day to day. Word/Pages like they mentioned, and probably Adobe things like PDFs as well, this markdown is sort of free of those shackles, or can be, if it's "flavorless". Right? Or am I not understanding?

Orpheus22 commented 3 years ago

With respect to 1, I see a number of differences. HTML's greater (yet still limited) set of tags gives it some advantages. I've found myself wanting to encode numerous features of this text which I couldn't distinguish through Markdown, but HTML can capture: for example, the titles of the various works mentioned in the Remarks could be distinguished at a semantic level by the cite tag. So I'd have some preference for incorporating more HTML into the transcription.

Question 2 involves a number of further items, beyond the titles from the Remarks which I've already mentioned: the page number at the top at the least (if not the catchword at the bottom); the historical personae identified there as well; the Latin texts appearing in the motto from the Remarks and in the quotation from the Aeneid in the Imitations; indentation in the lines of verse; further indentation to separate the verse numbers from the verse itself. I'm not sure if all of these cannot be captured through HTML, but I couldn't think of any way to treat them distinctively in Markdown.

Orpheus22 commented 3 years ago

Question 3: links! Perhaps this response is too obvious, but I feel it must be said that a major advantage of a digital edition is its potential as a hypertext.

And with respect to question 4, I , by employing Markdown alone, it seems inevitable that sacrifices must be made. But I don't see that those same sacrifices would have to be made by combining XML with CSS.

gabrielbodard commented 3 years ago

@Orpheus22: re greater expressivity of HTML, it is also worth noting that in most implementations of Markdown (perhaps not all) you can also embed HTML tags (including CSS) to capture any semantics or rendering features that the Markdown syntax can't handle by itself. If that's the case, then essentially there is by definition nothing that Markdown can't do that HTML can, if you see what I mean…

cmohge1 commented 3 years ago

@sergiobassocina many thanks for your reflections on the questions. Perhaps you could say the original implementation of Markdown is considered abandonware because it had no formal specification, but any flavour of Markdown such as GitHub flavoured markdown (such as is used here) is well supported and maintained. The initial idea of Markdown was to create a syntax for plain text that could be easily converted to HTML, which is obviously not abandonware. I would not call the current Commonmark (https://commonmark.org/) standard abandonware, though. Markdown can still be shared in several contexts. In any case, the question to consider is whether you will use Markdown as a quick transcription that will be transformed into another form. Ultimately it depends on the aims of the project.

Orpheus22 commented 3 years ago

@Orpheus22: re greater expressivity of HTML, it is also worth noting that in most implementations of Markdown (perhaps not all) you can also embed HTML tags (including CSS) to capture any semantics or rendering features that the Markdown syntax can't handle by itself. If that's the case, then essentially there is by definition nothing that Markdown can't do that HTML can, if you see what I mean…

Thanks for the response, @gabrielbodard, I take your point. I guess I'm conceiving Markdown as a resource taken on its own, as if from the point of view of someone (perhaps my students?) who might be able to pick up Markdown quickly, but would not yet have HTML in their command. But I do see how Markdown can be a welcome appendix to, or abbreviation of, the HTML tagset.

jbrown60 commented 3 years ago
  1. I think the digital format is well suited to representing different variants and editorial interventions. This can be helpful for people interested in searching for all cases of the word "scarce", say, whether it is written with long 's' or not
sergiobassocina commented 3 years ago

I have some problems on how/if we should/would mark up the page layout - the 2 columns of the “Remarks” section, for example. I tried but the HTML preview output is messy. I followed the rule from https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf “You can create tables by assembling a list of words and dividing them with hyphens - (for the first row), and then separating each column with a pipe | :” the preview is not the one I expected. Any ideas?

In addition to that, I launch two proposals:

Unicode 383 U+017F
UTF-8 197 191 C5 BF
aghague commented 3 years ago

@sergiobassocina

I have some problems on how/if we should/would mark up the page layout - the 2 columns of the “Remarks” section, for example. I tried but the HTML preview output is messy. I followed the rule from https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf “You can create tables by assembling a list of words and dividing them with hyphens - (for the first row), and then separating each column with a pipe | :” the preview is not the one I expected. Any ideas?

Do we need to preserve the columns? My understanding is that the focus of a markdown version of the text is on maximising the accessibility/usability of text content and internal structure, not on preserving layout. My current thinking is that we should preserve the text and its logical structure and ignore the columns, and perhaps insert a comment in the code to signal that the original layout is different. If I was working on this as part of one of my projects or for a classroom activity, I'd probably display an image of the original page alongside the encoded text.

In addition to that, I launch two proposals:

  • if we were scholars specialized in fonts, it would be interesting to encode the long s, variant of s. I doubt there is any Unicode for this obsolete character. I found one but with a horizontal bar:

Unicode 383 U+017F UTF-8 197 191 C5 BF

I thought about this, too, and I share your interest in preserving as many of the features of the text as possible. The more I thought about it, though, the more I became convinced that, in terms of its impact on textual analysis, the exact letterform used is pretty much irrelevant,. Once again, my current thinking favours inserting a comment in the code to mark the discrepancy and, perhaps, add an editorial note to the encoded version signalling and justifying the choice. Historians might feel differently, though - I hope we get to hear what they think.

  • if we were scholars obsessed with marginalia or readers’ practices, we might be interested in marking up the two pen scratches of the image- probably tagging a detail of the image itself? markdown can have folders of images attached to the main text. GitHub Logo Format: Alt Text

Related to this, I was wondering how much of the information pertaining to the textual source we should preserve and how. I'm thinking here of the two watermarks regarding the University of California & the Google digitisation, the timestamp, the URL and the Public Domain notice.

sergiobassocina commented 3 years ago

@aghague "My understanding is that the focus of a markdown version of the text is on maximising the accessibility/usability of text content and internal structure, not on preserving layout." Does layout convey meaning? Maybe for some users, yes it does. Let's think of the "Remarks" section: although it is framed in a different layout, it is authorial, not written by en editor. However, in other texts, the remarks are usually written by an editor. Le't s imagine that we are supposed to encode it, and that in one year's time a researcher would scan it (as it happens for example with Thesaurus Linguae Graecae). Will we make possible for him/her to detect a difference from a word occurrence in the text and a word occurrence in the paratext? We might want to distinguish the different levels of the text with a tag (which one? e.g. heading vs text vs paratext?). Maybe Markdown cannot.

aghague commented 3 years ago

@gabrielbodard sergiobassocina

@aghague "My understanding is that the focus of a markdown version of the text is on maximising the accessibility/usability of text content and internal structure, not on preserving layout." Does layout convey meaning? Maybe for some users, yes it does.

I would agree that the link between layout and meaning can be a complex one, and that the impact of preserving or, alternatively, sacrificing the original layout is likely to be felt stronger in some disciplines than others. I would still argue, though, that, in many (perhaps most) use cases , preserving the structure of the text, adding an editorial note to the output and having commented code in the markdown file is likely to suffice.

To give you an example, the way I would use a structured text version of this excerpt using contemporary letterforms in the course of my teaching would be the following:

Let's think of the "Remarks" section: although it is framed in a different layout, it is authorial, not written by en editor. However, in other texts, the remarks are usually written by an editor. Le't s imagine that we are supposed to encode it, and that in one year's time a researcher would scan it (as it happens for example with Thesaurus Linguae Graecae). Will we make possible for him/her to detect a difference from a word occurrence in the text and a word occurrence in the paratext?

I would deal with this by making sure my version of the text comes with its own editorial comments that clarify the process, and that the code that underpins it is properly commented. Following a similar discussion in the technical issues forum, I did a bit of research and found what looks like a good solution to commenting Markdown code: https://www.jamestharpe.com/markdown-comments/ I am now in the process of experimenting with this in my version of the course task.

We might want to distinguish the different levels of the text with a tag (which one? e.g. heading vs text vs paratext?). Maybe Markdown cannot.

My solution at the moment is to have the following content structure:

h1: The Dunciad h2: Book II h3: [p. ] 108 [verse] h3: Remarks [text] h3: Imitations [text] Footnotes:

I am not entirely happy with the page number being at the same level in the text hierarchy as the subsection titles, so I am still mulling whether to change this and introduce a h4 level for the subsection titles.

sergiobassocina commented 3 years ago

@aghague "commenting Markdown code: https://www.jamestharpe.com/markdown-comments/": wonderful insight, thank you! maybe h5: footnotes and h6: subsection titles and it would be great to be allowed to create a h2.5 level (I man, to go decimal a posteriori; what if one changes his mind on numbering the segments, only at the end?)

sergiobassocina commented 3 years ago

if you want we can share our .md documents in a folder. I uploaded mine here, feel free to comment: https://drive.google.com/drive/folders/1pkUJ_HDl_naODuqyoKbscBJnG--teB43?usp=sharing

aghague commented 3 years ago

@sergiobassocina I would love to have a look (and to share mine), but I am at a conference all day today (9am to 7 pm). I'll try to make some time for it this evening, but may be too tired to engage in a productive discussion. We'll see.

bgarnand commented 3 years ago

LIGATURES What might we want to be able to encode in this text that Markdown/HTML doesn’t allow? HTML has long s+t (as in this doc) but not the c+t, and I don't know if there's a font that has that character. I like how Markdown just displays special characters without the need to enter unicode, as in html

Audience of your web page? Does that affect the decisions you made? If my students are my audience, the ligatures and long-s would be off-putting -- it would be nice to have an option to either diplay those charactes, or not (e.g. mouse-over)

aghague commented 3 years ago

@sergiobassocina

if you want we can share our .md documents in a folder. I uploaded mine here, feel free to comment: https://drive.google.com/drive/folders/1pkUJ_HDl_naODuqyoKbscBJnG--teB43?usp=sharing

My version is here: https://github.com/aghague/coursework/blob/main/The-Dunciad-VS.md I haven't managed to look at yours yet, sorry - it's been a very long day.

cmohge1 commented 3 years ago

@bgarnand This is a great question, and a challenging one. As with so many digital encoding questions, you have to puzzle out the importance of (sometimes arbitrary) printing conventions and whether those conventions ought to be encoded (and why). If you look at the Unicode guidelines (https://www.unicode.org/faq/ligature_digraph.html), there is no good reason to encode such font-specific features because ultimately the machine needs clarity about how to parse the text strings. By replacing two distinct letters with Unicode you may confuse word searches. A ligature was a pragmatic printing convention that was adopted to avoid breaking type between two characters. In any case, if Unicode has a ligature then by all means you could encode that. But a good solution is to show a facsimile of the original typeface alongside a machine-readable transcription.

aghague commented 3 years ago

Here are my current thoughts on the “Markdown vs. HTML” issue:

Markdown versions are very quick to produce, much quicker than HTML, and I can see Markdown becoming my default way of dealing with text – whether transcribed or created from scratch. Any major formatting I may wish to do would come later, but I do not see it as an inconvenience since it’s how I work anyway even when using a WYSIWYG editor.

Having said this: the Markdown I would use would not be “pure”, as it were, but lightly hybridised to allow indents and other blank spaces and, most importantly, code comments that would allow me to document my coding choices and/or list outstanding issues.

My interest in this particular text is mainly cultural and literary, so fully preserving the layout or the original text or the exact letterforms used is not a priority for me. As a Victorianist, I am used to working on different editions of a given source and, unless I’m investigating a multimodal text or a book history issue, the layout of the pages is largely irrelevant to the projects I am working on. What I value most is legibility and interactivity – the ability to resize the text, to annotate, to perform automated searches, and so on.

Nevertheless, I do think there is value in seeing how the text was originally displayed and how the layout and letterforms may have changed over time, and so I think that including links to digital facsimiles or to the facsimiles themselves should be attempted wherever possible.