erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Referencing pieces of text #307

Open michaelnmmeyer opened 1 month ago

michaelnmmeyer commented 1 month ago

This is a discussion of the referencing issue I alluded to in our mail exchanges about the new release of the EGD.

The purpose is to define a machine-actionable reference system for pieces of text: verses, lines, etc. Given a reference in some defined format (e.g. "line 5" or "face A, line 5 to line 6"), the machine should be able to locate the corresponding text in the XML file, and, optionally, to extract it.

I assume we only need this reference feature for the edition division. In this context, the following elements might bear a @n:

<milestone type="pagelike">
<pb>
<lb>
<milestone> # in gridlike partitions
<div type="textpart">
<ab>
<p>
<lg>
<l>

Referring to gridlike <milestone>s does not seem useful, thus I ignore this case. The EGD does not prescribe to number <ab> and <p>, but I add them nonetheless in the discussion, in case people need that; currently, we have less than a dozen cases where they bear a @n.

We need a notation that can be processed by a machine, but I assume we want it to look fairly natural nonetheless (as in the examples above: "face B, line 5 to line 7", etc.) Since the format of @n is not restricted ("A" can represent a textpart division or a milestone, for instance), it is necessary to make each unit explicit, as in "face A, line 1" instead of "A1", "line A1", etc.

Let us review each element and describe how references to it would look like. The difficulty is to find unit names that look natural enough but that also unambiguously refer to a given XML element with a given set of attributes.

<div type="texpart">

If the <div> has a @subtype, use it as unit. Otherwise, use a unit named "part". Thus:

<div type="textpart" subtype="item" n="A">
  -> "item A"
<div type="textpart" n="A">
  -> "part A"

Now, textpart divisions might have a heading (declared with <head>), which is supposed to be displayed instead of the one that would have been generated otherwise. Thus,

<div type="textpart" subtype="item" n="A"><head xml:lang="eng">Frontal Face</head>

... would result in "Frontal Face" instead of "Item A". Since "Item A" is not displayed, it is not possible for the reader to tell, without looking at the XML, what corresponds to the reference "item A".

To address this, we can either display "item A" in some way, or use the heading as reference, and generate references like "Frontal Face, line 1", etc. The first solution seems preferable to me, because, for the second to work, headings need to be unique, and this is not prescribed by the EGD.

<p> and <ab>

Use "paragraph" as unit:

<p n="1">
  -> "paragraph 1"
<ab n="1">
  -> "paragraph 1"

This assumes that p/@n and ab/@n are unique among all <p> and <ab> elements in a given division.

<lg>

Verses @n are displayed in Roman numerals, thus it seems natural to do the same in references. We would have:

<lg n="1"/>
  -> "verse I"

<l>

We cannot use "line" as unit, because this would better fit <lb>. We cannot use "pāda" either, since many <l> represent hemistiches. We could use "verse line" as in:

<lg n="1"><l n="a"/></lg>
  -> "verse I, verse line a"

This is verbose, however.

<milestone type="pagelike"> and <pb>

For milestones, use the provided @unit as unit; for <pb>, use the unit "page". Thus:

<milestone type="pagelike" unit="block" n="A">
  -> "block A"
<pb n="A">
  -> "page A"

This assumes that <pb> is equivalent to <milestone type="pagelike" unit="page">.

<lb>

Use "line" as unit.

<lb n="5">
  -> "line 5"

Conclusion

For the above to work, we need to make sure that the value of div/@subtype and milestone/@unit do not coincide in the same inscription; otherwise, the reference would be ambiguous. This is not enforced currently.

Likewise, div/@subtype and milestone/@unit should not have a value that is used as unit elsewhere; more specifically, they should not have the value "paragraph", "verse", "page", "line", "verse line".

For representing hierarchies, I propose we use a comma (e.g. "part 5, line 2"), as in normal references. For representing ranges, however, I do not think we can get away with using a dash (e.g. "line 2-5"), because this might mess with the format of @n. I thus suggest we use the explicit format "$unit n1 to $unit $n2", as in "line 2 to line 5". I think this can be parsed without ambiguity.

danbalogh commented 1 month ago

This seems to be a direct continuation of #270, which I think everybody should at least skim before commenting on the present issue.

I think we can all live without machine-actionable references to gridlike milestones, which would complicate the reference system very much (see e.g. here).

We never planned to number <p> and <ab> elements and I don't think we should start doing so now. If anyone disagrees, please speak up now and state your reasons :) Otherwise, having an @n on a <p> or <ab> in the edition division is an encoding error - if you can give me a list of where this occurs, I can have a look at the files. (Note that numbered paragraphs are of course mandatory in the translation [except for very short inscriptions] and permitted in the commentary.)

@michaelnmmeyer , what I don't understand here is the purpose for which you want to define the terms/units to use in references, such as "part" and "line". My best guess is that you are talking about generating display for encoded references, so that e.g. an encoded reference to a particular point in an edition would be displayed using these terms. Is that right? If not, or this is not the only thing you have in mind, then please clarify: why do we need a rigorous set of terms and what is wrong with e.g. referring to something displayed as "Frontal Face" using a reference involving "item A"? Are you for example talking about free-text queries that would be parsed by machine to jump to the requested point? I think that would be overkill. (Incidentally, if your example <div type="textpart" subtype="item" n="A"><head xml:lang="eng">Frontal Face</head> is from a file, then it is wrong; a "face" of an object is not an "item", which means "physically distinct objects". If it's just something you wrote as an illustration of the technicalities, then fine, but if it's from a file, then the encoder should be instructed to correct it.)

My comments on the rest of your post may be partly off because of the above lack of my understanding of the purpose.

For textpart divs without @subtype, I have no objection to the term "part". I'm also OK with displaying "Part " in the auto-generated headings for such textparts (if there are any textpart divs in our corpus that have neither a subtype nor a head), and with displaying "Part . " or " . " for textparts with a head, without or with subtype respectively. Implementing this should alleviate your concern that "it is not possible for the reader to tell, without looking at the XML, what corresponds to the reference "item A"". But @arlogriffiths may have a different opinion on this, since he and his team deal with cases like "South Doorjamb" which, when encoded as a textpart, has @subtype="item" @n="S" but the desired display is just "South Doorjamb" and not "Item S. South Doorjamb" (I'm basing this on resolved EGD comments of May and June 2020, which actually concerned milestone labels). I would also be OK with making it explicit in the EGD that textpart heads must be unique. I think everybody would have taken this implicitly for granted anyway, as there would be no point to labelling your textparts if you used non-unique labels.

As I said above, I don't think we should introduce numbering for p and ab elements, so there will be no referencing of such elements.

For stanzas, displaying Roman numerals is fine. I'm also perfectly happy with "verse", but @arlogriffiths is likely to object to this, since by his definition the English term "verse" can also mean "one line of poetry" and rigorously uses "stanza" for "a cohesive group of lines". So perhaps use "stanza" instead - though since I don't think I have ever seen "verse" used in the sense of "line" in Indological literature, while "verse" in the meaning "stanza" is ubiquitous in the same literature (and far more common than "stanza"), we might perhaps reconsider that.

In referring to verse lines, I agree that "verse line" is cumbersome. If we are only talking about the display of encoded references, then I think displaying nothing (just the @n) should be fine. Nobody will ever refer to a verse line without first identifying the verse (stanza), and the usual form of such references is simply "verse I a" or "verse XXV ab". (Whether to use a space between the stanza number and the <l> number, or to use nothing, or to use a comma and a space, is indifferent to me.)

For pagelike milestones, the solution should be the same as that for textparts, with @unit analogous to the @subtype of textparts, and <label> analogous to the <head> of textparts. The case is a little bit simpler since textparts are permitted without subtype, but milestones are not permitted without unit. However, the concerns about things like "south doorjamb" apply here too (actually, I think most, perhaps all, such cases have ultimately been encoded with milestones, not textparts), so referring just to the unit and n raises the same problems as referring to subtype and n in textparts does.

For <lb>, yes, line.

For uniqueness across div/@subtype and milestone/@unit: I again don't understand the purpose for which you need this. If my assumption that you are only talking about generating display, then I don't think this is much of a problem. The reference would be correctly encoded to a certain div or to a certain milestone, and the machine action would be based on the encoding, so there would be no processing difficulty. Editors (encoders) would never in my opinion use the same token for div/@subtype and milestone/@unit unless they then used unique values of @n across these, since they intuitively understand that these serve for segmenting and labelling the text, and they would not use the same heading twice in a printed edition. But if I am naively wrong about that and once in ten thousand editions we might potentially have a reference that is ambiguous for the human reader, I think we can live with that, since the ambiguity would only be present in the display, and not in the machine action. Or am I missing something and you want to use these terms for something other than display? At any rate, I have no objection to making it explicit in the EGD that if an edition includes both textpart divs and pagelike milestones, and the editor deems it best to use the same token for the head of the former as for the label of the latter, then the <n> of each of these must be unique.

I'm also OK with explicitly forbidding "paragraph", "verse" (or "stanza", "page", "line" and "verse line" (if we keep it) in div/@subtype and milestone/@unit.

I agree with using commas to separate the elements of the hierachy (except possibly for verse lines, see above), and with verbose display using "to" instead of hyphens.

michaelnmmeyer commented 1 month ago

The @n attribute is (legitimately?) used with p and ab in critical editions, I am preparing for that, even though there is no display yet.

This whole referencing thing stems from Amandine's project. She wants a way to refer to verses with hyperlinks, so that you can navigate between her table of verses and occurrences of these verses in our inscriptions. I am trying to make this more generic to allow hyperlinks not just to verses, but also to other pieces of text (divisions, lines, etc.).

This requires some kind of URI-like notation. There is a standard one that would work, namely XPath, but I doubt people will want to use that. For instance, "face B, line 5" could be encoded as div[@type="textpart" and unit="face" and n="B"]//lb[@n="5"]. I am just trying to find a more palatable notation.

I believe it would be convenient if this notation looked like a "normal" reference, but this does not really matter from my side. It could as well look like /textpart[B]/line[5], etc. What matters it that the notation allows me to locate the matching element in the XML document.

danbalogh commented 1 month ago

OK, so if I understand you correctly, we're basically talking about creating a new "code language" for references, that will have to be formulated rigorously by the person who uses it, and will be parsed by the machine. If devising this system does not take up an inordinate amount of time, does not require an inordinate degree of revision in the established encoding practice and the existing files, and is likely to be used by others beyond Amandine, then I'm all right with that. But if any of these three conditions are not met, then I think we (and especially the PIs) should seriously consider whether we really need to go there instead of, as you say, using XPath, which requires extra effort and learning on the part of the person(s) who will be encoding such references, but does not affect anyone else.

As for numbering paragraphs, I'm of course OK with doing so in critical editions and have not been aware that this was done there. I'm likewise OK with referring to both <p> and <ab> elements as "paragraph ", though of course those who number such elements will then need rigorously to apply the same consecutive set of numbers to both (e.g. p1, p2, ab3, p4 etc... and not p1, p2, ab1, p3 etc., even though the former seems counterintuitive to me). [At this point I raise again the possibility that we just stop using <ab> altogether and change all existing <ab> elements to <p> in our code. The distinction we make between p and ab is slightly fuzzy and not really useful.]

I still do not think we should introduce paragraph numbering for inscription editions. It is not clear to me if anyone has suggested that we do so, perhaps for Amandine to be able to refer to paragraphs from her table. If this is the case, then I think that instead of changing established projectwide practice for the sake of an individual project, it is the latter that could adapt and use what is already there, for instance line numbers, perhaps in pairs, pointing to the line in which a segment relevant to her begins, and also pointing to the line in which that segment ends. This would be actually be a more accurate reference system than referring to paragraphs, since the prose passeges of interest to her will in many cases be small parts of long paragraphs.

I am also slightly worried that if she (or anyone) starts referring to specific parts of our editions, i.e. essentially applying standoff markup to them, then what happens if our editions change? What if next month I revise one of my editions by splitting a previous long paragraph into two shorter ones? What if I get access to better visual documentation than before, and realise that a poorly legible passage originally encoded as prose was in fact verse, so that stanza numbering has to change in the file from that point onward? The only solution to this problem that I can see is for Amandine to create a fork of the repositories for her references, or to include some method of versioning in those references. Neither is ideal, and we can perhaps just live with the risk that such changes may happen (after all, we do the same when referring to stanza X of an inscription in a print publication). But someone needs to consider such eventualities.

michaelnmmeyer commented 1 month ago

For paragraphs numbering in diplomatic editions, I do not think anyone suggested it. I wrote a self-reminder to talk about it, but I do not remember the reason.

You point out a major issue in the last paragraph. For Amandine's case, I planned to save, for each reference, a commit hash plus the verse referred to, so as to detect potential modifications in numbering. This is not great, to say the least, but the alternative is to store multiple revisions of our texts in the database, which would bring too much complication. I did that in the beginning, but abandoned the idea soon after.

In any case, maintaining working hyperlinks across revisions is a lot of (manual) work. I never do that in my own projects, precisely because the extra work does not seem worth the effort. If the original text is cited, finding its location does not take much time. Maybe it would not hurt much to just refer to the inscription id? To discuss with Amandine.

manufrancis commented 1 month ago

I concur with Daniel. If this was thought specifically for Amandine and given that indeed the paragraph structure may change after revision, I also think we should not start numbering <p> in editions of inscriptions. As long as Amandine can collect instances of formulaic stanzas thanks to the parallel search engine (https://dharmalekha.info/parallels), with inscription ID and location in the inscription (<lb>), I think we are good. As for the introductory sentence that often precedes formulaic stanzas I have let her know that she would most probably have to collect them manually.

arlogriffiths commented 1 month ago

sorry, overwhelmingly busy. I concur with Manu's response and ask for patience so I can try to respond in due course to matters on which an explicit reaction has been requested from me.

danbalogh commented 3 weeks ago

I think this discussion hasn't been quite sorted out. At any rate, I have one additional observation here: the above statements about the need to make page, pagelike milestone and line numbers unique within any division needs to be qualified. When textpart divs are present, the uniqueness is a requirement within each textpart, but not within the div containing the textparts. I noticed when I opened https://dharmalekha.info/texts/INSVengiCalukya00091 that I now get a warning that pb and pagelike milestone elements do not have unique @n within this division (the edition div). The structure of this text is as follows:

textpart A: seal
  no subdivisions
textpart B: Plates
  pb 1r
  pb 1v, etc...
textpart C: Palimpsest
  pb 1r
  pb 1v, etc...

The identical pb numbers in textparts B and C need to be permitted, since they are in fact the very same pages (on which traces of an earlier inscription can be made out). But identical page or milestone numbering in textparts may be desirable in other cases too, so uniqueness must be enforced on the textpart level only.

michaelnmmeyer commented 3 weeks ago

This was intentional, addressed in 4c29818.

danbalogh commented 3 weeks ago

Did you mean unintentional? Anyway, I'm still getting the same error message on 00091.

michaelnmmeyer commented 3 weeks ago

I will have to write something more complicated. I expected <pb> only to occur within a <p> or a <ab>.

danbalogh commented 3 weeks ago

<pb/> can and must be outside <p> or <ab> for the first and last blank pages of copperplate sets, EGD §3.5.2. There's also another special case where this is permitted, but I do not know if it occurs in the corpus: lost medial plates as per §5.4.8.

arlogriffiths commented 3 weeks ago

yes, we certainly have such cases in our corpus. here's an example, though I am not sure the lost medial plates have been correctly encoded: https://dharmalekha.info/texts/INSIDENKTerep_II.