Traversing the Tree and Overlapping Hierarchies

ebeshero / DHClass-Hub

a repository to help introduce and orient students to the GitHub collaboration environment, and to support DH classes.

GNU Affero General Public License v3.0

27 stars 27 forks source link

Traversing the Tree and Overlapping Hierarchies #739

Closed ebeshero closed 4 years ago

ebeshero commented 4 years ago

Here is our first Discussion assignment for the semester:

The reading:

Read Gabrielle Kirilloff, “<Traversing_the_Tree/>”
Check out the (very short) article, “Frankenstein novel analyzed” and scroll through Wendell Piez's conference talk and images for the Balisage Markup Conference 2014. If you like, you can take a close look at his LMNL code of Frankenstein on GitHub. Special note of interest: Gabi Kirilloff was a student in a digital humanities course at Pitt like the one you are taking, and she originally wrote <Traversing_the_Tree/> for a seminar paper assignment in another class.
The discussion prompts:
What perspective does Kirilloff provide on the kinds of XML markup we are learning, the history and context of hierarchical markup?
What problems does hierarchical markup pose for encoding documents?
- What ideas do Kirilloff or Piez present for how to deal with these in code, and how effective or problematic might these be?
- Is it possible to write XML to "get around" the problems raised in these pieces? What's lost (or gained) in making XML's hierarchical structure deal with overlap?
Consider the examples of overlapping hierarchies that Kirilloff and Piez present to us: Which of these did you find especially interesting? Are there good ways to "model" overlapping hierarchies with code?

The discussion is a homework exercise and worth credit. Your post should make specific reference to passages in Kirilloff's essay, and reflect on those passages. You should make at least two substantial posts to fully contribute to the discussion. Note: You (as an individual) do not have to respond to every one of the discussion prompts, but our class as a whole should cover them all. You might want to reply to at least two of the prompts in the list above. Raising questions is encouraged, and so is responding to each other, but responding should do more than simply say, "yes, I agree." A good response will add something new to the conversation, or help promote more discussion.

As you're drafting your comments, see if you can apply "Markdown" formatting if you'd like to use bold or italics or make a list, form a link, add an image, etc. Follow the link to "Styling with Markdown is supported" (which you can always find at the bottom left of an Issue write screen) for an orientation to Github's markdown.*

alnopa9 commented 4 years ago

@bobbyfunks @robftg @Bryant-LettucePrime @biancamaginley @benjaminc2020 @lmcneil7 @smdunn921 @amberpeddicord @ChinoyIndustries Reading & Discussion post to work on over the weekend! 😄

ebeshero commented 4 years ago

Okay--let me get this started! There seems to be a lot of readings to look at here, but the main one is the one by Gabi Kirilloff discussing XML tree structures. She was a student instructor (like our student instructor team now) at the Pittsburgh campus and she wrote this up based on her reflections on XML structure and some research she did about the early days of markup in the 1990s. I like assigning this early in the semester because Gabi was reflecting on the XML tree that you're learning how to code, and talking about some problems--or potential problems--we can find with it. I'll be modeling some of this in class tomorrow. The issue is to do with things that overlap and what happens to them when we make markup fit into a hierarchy with just one root element.

In your very first XML assignments, you're just getting used to creating a hierarchy at all, so I definitely understand if this is a bit complicated to take in! But it all begins with the concept of a tree:

<root>
     <branch n="1">a <twig>twig <leaf>with something coming out</leaf> and maybe also a <flower>blooming bud</flower></twig> inside the  branch</branch>
    <branch n="2">another branch, <twig>with more structure inside</twig></branch>
   <branch n="3">yet another branch, <twig>with its own structure inside</twig></branch>
<root>

In class on day 1 we talked about how the Tree structure sort of looks sideways in an XML document like this. There's a root element wrapping around the whole thing, and inside, elements that form a structure, and create a hierarchy. We could say level 1 is the <root> element itself containing the whole document. And level two in this example would be each of the <branch> elements. There's a level 3 at the level of the <twig> (nested inside <branch>), and level 4 inside that of <leaf> and <flower>. This kind of structure is a tree because of that one root, and it's also called an Ordered Hierarchy of Content Objects (or OHCO). OHCO is orderly and tidy, easy to see in the outline view of any well-formed XML. Of course we love OHCO, but...

...But in the real world, as Kirilloff points out, documents are more complicated than that, and we can find structures that defy this tidy organization. (It happens in poetry all the time, when a poet write a sentence to run over the ends of lines.) What kinds of examples do you see in these readings of overlap that complicates a hierarchical structure for XML? To answer my own question above, yes we can write XML get around the overlap problem...this might be sort of cheating somehow, but look for examples of that, and let's talk about them!

smdunn921 commented 4 years ago

XML is an easier way of marking up a text, mainly because it's easier to customize, but ordered hierarchies can cause us to run into issues of overlapping hierarchies. This can cause problems when, for example, tagging a poem-- if you want to tag a specific word and the lines of the poem are tagged, but the word runs over multiple lines, this will cause an issue. You aren't able to tag anything that goes outside of each line. Kirilloff makes a point about self closing tags (or empty elements, as they are called in the text) being a possible solution to overlapping hierarchies, but these can have problems of their own, since it's pretty much just a way to "trick the system."

amberpeddicord commented 4 years ago

Kirilloff provides the reader with the history of XML and how it grew from SGML (or, Standard Generalized Markup Language). SGML was designed to be used on documents in law, history, or government, and was never intended to be used as a tool for humanities scholars. Additionally, it assumed that the document being encoded would have one primary hierarchy, and this hierarchy was dependent upon the genre of text. However, as Kirilloff explains, literary and humanities scholarship often has overlapping hierarchies. When a paragraph runs over several pages, or a word runs over several lines, there are instances of overlapping code that the coder must work around.

My favorite example of overlap is Piez's code for Frankenstein, which is a frame narrative story. This means that the novel is an instance of a "story-in-a-story". Because of this, there was extensive overlap in his "bubbles". And, as he explains, there are instances in his code of large elements "losing hierarchy" due to the saturation of smaller elements in the novel. To get around this, Piez experiments with LMNL (Layered Markup and Annotation Language), which is similar to XML but relies on arbitrary text ranges.

ChinoyIndustries commented 4 years ago

One very general thought I'd like to tack on to "Traversing the Tree":

Kirilloff goes to quite a lot of trouble, early in the article, to describe and emphasize the dangers of how the process of encoding itself adds a layer of interpretation, on the part of the scholar, that necessarily affects how the document would be read and utilized by others. It seems to me that while this is something that should always be taken into account, it's actually a very positive aspect of encoding.

One of the best things about Digital Humanities is that the digital documents we produce are living documents--that is, they can be modified, taken apart and put back together by whoever wants to work with them just by downloading a copy (in a perfect copyleft world anyhow). We aren't constrained by putting things into printed matter, publishing them in a particular way that remains fixed for as long as the physical document exists. A living document can grow and change.

What's important, above all else, in working with encoded texts is to always consider the purposes for which we actually want that text--how we want to analyze it, what kinds of data we might want to extract from it. Good stewardship of texts, however, also involves considering what purposes other researchers and scholars might find for our texts. If we build an XML document (or any other sort of document) to search for one particular aspect of a text we're looking for, it may have very little value to other researchers even if we publish it online. Keeping options open and considering all the applications of a text is key.

Thankfully though, because digital documents are living documents, we don't have to worry about providing everything perfectly for everyone--anyone with enough understanding of how we code things can change that encoding to provide new options for seeking out data in a text. As long as we can keep in mind that the encoding of an existing document is colored by how it was intended for use by its creator, we then know we aren't limited by the information they've chosen to tag.

lmcneil7 commented 4 years ago

Gabrielle Kirilloff's "provides insight into the history of digital humanities and emphasizes how meaningful the relationship between scholar and mark-up is to the creation of SGML (Standard Generalized Markup Language), XML (eXtensible Markup Language) and OCHO. First, she began asking an essential question: why even bother with coding? She gave insight into criticism or arguments made against digital humanities and provided context to what coding means. Without getting into XML or SGML, she was able to show that computing tools have been essential to us for a while, and there's no reason why coding is any different. She also mentioned how encoding could use analysis for detailing the relationship between the scholar and the mark-up and its effect on each other. Thus, the creation of SGML and XML.

Her focus is on providing us with history for a better understanding of coding itself. For instance, not many people use SGML anymore, and while it might seem there's no reason to figure out why, Kirilloff creates a map showing how SGML is the reason that humanists created XML. It's the weaknesses and limitations of SGML that led to XML, but knowing about both gives more perception into the minds of the humanists who designed them. Not to mention OCHO is the creation of SGML and XML's inability to convey hierarchies. It's being able to communicate the history without trying to replace everything that history has taught us.

One of the problems that XML fixes are the overlapping hierarchies. SGML's design is focusing on a logical one-minded hierarchy. The one-minded hierarchy worked until they found numerous occasions where hierarchies overlap; the example she uses in the text is <line>The four<word/>eng-</line>ineers<word/>. It's important because it adds more structure to documents that we mark-up, such as poems.

smdunn921 commented 4 years ago

Good stewardship of texts, however, also involves considering what purposes other researchers and scholars might find for our texts. If we build an XML document (or any other sort of document) to search for one particular aspect of a text we're looking for, it may have very little value to other researchers even if we publish it online. Keeping options open and considering all the applications of a text is key.

Yes! I agree with this! If there's only a couple of things marked up that you're using to benefit your research, then that's not going to benefit as many people as if you were to go in and consider the other possibilities for which other people might use the same XML file. For example, in @lmcneil7 and my project for Teen Titans, we marked up a couple of extra things that we didn't actually talk about or show, because we might build on them in the future, or other people might be interested in it. Sometimes we had to figure out which words to put certain tags around to not have the issue of an overlapping hierarchy, but it was thorough and we were able to find plenty of things within the texts to make for interesting research.

I feel like this, while beneficial, is only so if you are mindful and know your limits. On a smaller project, it might make more sense to NOT have all kinds of elements and attributes scattered throughout because not only can it can turn into an eyesore, but then you run the risk of having to deal with the overlapping hierarchies. Bottom line is that with XML, you can always make it interesting while keeping it simple enough to not run into those issues as badly and still allow room for others to build on it.

bobbyfunks commented 4 years ago

OHCO makes sense for a general way of structuring a document but obviously has limitations, especially when it comes to more abstract and artistic works. The tree analogy works but like a tree, there isn't just one root. So having multiple root elements that can serve as themes or trends, feeding into a base root, or trunk, then branching out in different directions would make for some unwieldy code, but could offer digital humanists more interpretive tools to better dissect a piece of text. No matter what someone does with a written work is going to involve interpretation, so instead of confining people to a specific ruleset or hierarchy, allowing them more avenues to better communicate their interpretation would lead to more interesting discoveries.

lmcneil7 commented 4 years ago

@ChinoyIndustries I like your point about living documents due to the creation and evolution of coding and technology. For instance, if you wanted to compare two versions of a document or modify a script, all you would have to do is download the content. However, I don't think the printed matter is the constrained alternative.

A point that Piez makes in the article “Frankenstein novel analyzed,” is that there are still issues in coding that are difficult to overcome; the obstacle he describes is text size — for instance, comparing a novel with a poem. XML gives us an insightful way to compare them by using overlapping hierarchies, but using the printed matter provides us with the information that coding might leave out. Finding a way to combine both the coding and the printed matter would lead to an enormous living document. Some examples of this are the websites we created. They coded and mark-up files using XML and indicating what we changed or found important while also highlighting the importance of it in its original form. An even more specific example would be the metadata of the comics.

ebeshero commented 4 years ago

@lmcneil7 and @smdunn921 I've been meaning to talk to you about this, and the whole class may be interested too. There's a community of markup practitioners (XML coders) working on comic books, and they created a special set of XML tags for it called CBML (Comic Book Markup Language). It's adapted from the XML language of the TEI (Text Encoding Initiative) that I mentioned today in class. You should check it out here, and maybe try your hand at coding some comics material with it just as an experiment: http://dcl.ils.indiana.edu/cbml/

biancamaginley commented 4 years ago

One of the issues brought up in Traversing the Tree was what to do when two tags are placed correctly to describe the text, but not formatted correctly within each other. One of the ways to solve this problem, was to simply use self closing tags to mark the places of the text. This solved the issue with nesting, but it caused other problems. It wasn't necessarily describing the text, but was more of a book mark. Attacking the issue this way, in my opinion, goes against the point of marking the text in the first place, and you lose the meaning behind the tags.

benjaminc2020 commented 4 years ago

Kiriloff provides us with the perspective that XML is a type of markup language, and one of the most prominent tools used to make text "machine-readable". Kiriloff then goes into the history of how XML is a product of SGML, and how XML has a subset TEI. In regards to hierarchical markup, Kiriloff gives us the positive view of how the tree-based model works very well and can often lead scholars to notice new and unusual features of a text.

On the other hand, Kiriloff also gives us a window into some of the issues presented with the OHCO based system, such as tangled tags that end up overlapping hierarchies. To remediate this issue, Kiriloff suggests using "emtpy elements". While it is possible to write XML code to get around this issue, I personally see these approaches as more hindering than helping. There is value to being short and concise when writing code. That being said, it may be necessary the re-evaluate the initial need to add a tag that overlaps hierarchies when it is possible.

biancamaginley commented 4 years ago

Another thought that I had while reading Traversing the Tree was a comparison to XML vs HGML. HGML was meant to be used with large law and government projects, so it was assumed that the genre the text had would cause it to have the same hierarchy as others in the same genre. XML is more flexible with it's usage and hierarchies. When I was trying to understand the differences and the uses, my thoughts drifted to categorization. Trying to categorize a person isn't just checking a yes or no box for a few questions. People are complicated and intricate, as are texts. Just because a text can be categorized under the same genre, doesn't mean that the Markup is going to be the same, just like with people. Even if you're from the same town or ethnic group, there is more to consider before you slap a label on it. I like the thought of XML being a form of art that cannot be copied from one document to another.

benjaminc2020 commented 4 years ago

OHCO makes sense for a general way of structuring a document but obviously has limitations, especially when it comes to more abstract and artistic works. The tree analogy works but like a tree, there isn't just one root. So having multiple root elements that can serve as themes or trends, feeding into a base root, or trunk, then branching out in different directions would make for some unwieldy code, but could offer digital humanists more interpretive tools to better dissect a piece of text. No matter what someone does with a written work is going to involve interpretation, so instead of confining people to a specific ruleset or hierarchy, allowing them more avenues to better communicate their interpretation would lead to more interesting discoveries.

Text is very much multipurposeful and extensive in the ways in which it can be interpreted. So I agree completely. Encoders shouldn't be confined to a specific ruleset when there are mulitple ways to interpret things and digest the writers/authors intention. When writing XML, I believe this is important to take into consideration and especially important to be careful of. Often I can find myself catering too much to the idea that XML is to make text machine-readable, and can neglect adding more of a human-minded approach. I find there to be value to both ends of this spectrum, but difficulty in balancing them out.

ebeshero commented 4 years ago

@biancamaginley Thanks for bringing HGML to our attention! I wasn't aware of this markup language before (and I thought you might be referring to HTML, which is markup for web browsers, or Hypertext Markup Language), but I was surprised to see two different kinds of HGML out there, from around the years 1998-2000:

Hypertext Guideline Markup Language: This one is used for medical documents, which seems like what you might be referring to, since it's meant for institutional documents.
Hyper Graphics Markup Language: which seems like an early version or alternative version of Scalable Vector Graphics or SVG, (which is markup used to draw shapes made of vectors). We'll be making some SVG in this class!

bryant-bolyen commented 4 years ago

In Kiriloff's piece, the example of Overlapping Hierarchies confusing the Ordered Hierarchy of Content Objects in The Unrhymable Word: Orange by Willard Espy illustrates what sort of care must be taken to identify proper content objects. 'Line' isn't a bad one, I don't think, but 'word' would evidently be a colossal mistake and I'd like to think no one seriously marking up that poem would consider using it. Since the poem isn't really concerned with the meaning of the terms as much as it is the sounds identified, perhaps trying to denote rhymes or sounds is a more useful application of markup.

As for OHCO as a concept unto itself, maybe I'm betraying a kind of boring, conservative tendency inside me, but I really kinda like it. I think viewing language through this prism offers useful insights into the structure of communication and conceptual meaning, as well as all the ways it may be broken: a bit like music theory but for conscious thought. House of Leaves is offered as another wrench thrown in the entire schema, but I disagree. Even if disagreements between Content Objects lead to Overlapping Hierarchical heresy, multiple different coders may markup House of Leaves multiple different ways using multiple different Hierarchies - and a comparative analysis of these interpretations may itself lead to some sort of meta-interpretation outside the tyranny of the XML editor. I think this philosophy on markup is much more elegant, streamlined, and generally sexier than quite a few of the things I've done in XML assignments thus far: inferring which of Byron's children he may be referring to in particular passages for example. I think it's just more fun and challenging to see what's already obviously communicated in the text in an original way than it is to add to the interpretation with paratext.

bryant-bolyen commented 4 years ago

What's important, above all else, in working with encoded texts is to always consider the purposes for which we actually want that text--how we want to analyze it, what kinds of data we might want to extract from it. Good stewardship of texts, however, also involves considering what purposes other researchers and scholars might find for our texts. If we build an XML document (or any other sort of document) to search for one particular aspect of a text we're looking for, it may have very little value to other researchers even if we publish it online. Keeping options open and considering all the applications of a text is key.

YES. HOWEVER.

I think a case may be made that every text will likely carry an entire library of potential XML iterations, and studying the possible ways a piece of text may potentially be interpreted is a rapidly maturing discipline that is likely going to dominate the future of discourse, communication, the written word, and content objects. Good stewardship of texts, therefore, may involve categorizing and filing away an ever growing number of OHCOs for the same piece of writing, and determining the specific properties, uses, and failures of each digital method of interacting with this text.

ebeshero / DHClass-Hub

Traversing the Tree and Overlapping Hierarchies #739

Here is our first Discussion assignment for the semester:

The reading:

The discussion prompts: