erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Physical structure and display #270

Open michaelnmmeyer opened 4 months ago

michaelnmmeyer commented 4 months ago

I propose a few assumptions and clarifications to improve the mechanical processing of inscriptions' physical structure. This concerns the elements <milestone>, <pb> and <lb>.

I assume there are three types of physical elements: pagelike, linelike and gridlike. In the physical display:

So we have something like:

(pagelike)

(linelike) one two three four (linelike) five six (gridlike) seven eight

(pagelike)

(linelike) nine ten (gridlike) eleven

Now:

This clarifies the display, but still does not allow me to figure out the elements' hierarchy. For instance, if I have:

<milestone n="1" type="pagelike" unit="block"/>
<pb n="X"/>
<milestone n="A" type="pagelike" unit="fragment"/>

I cannot tell, in the general case, how these elements nest, so we could have, among other solutions:

1. Block
X. Page
A. Fragment
1. Block
   1X. Page
   1A. Fragment
1. Block
   1X. Page
       1XA. Fragment

If we want to have a "real" hierarchy between physical elements, the simplest solution I can think of is to use a numbering scheme that allows to tell unambiguously how the physical elements fit into each other. For instance, if we have:

<milestone n="D" type="pagelike" unit="block">
<milestone n="D$A" type="pagelike" unit="face">
<lb n="1">
<milestone n="a" unit="column">
<milestone n="b" unit="column">
<lb n="2">
<milestone n="a" unit="column">
<milestone n="b" unit="column">
<milestone n="D$B" type="pagelike" unit="face">
<lb n="1">

I can tell, just by looking at @n, that the structure is:

<milestone n="D" type="pagelike" unit="block">
   <milestone n="DA" type="pagelike" unit="face">
      <lb n="1">
         <milestone n="a" unit="column">
         <milestone n="b" unit="column">
      <lb n="2">
         <milestone n="a" unit="column">
         <milestone n="b" unit="column">
   <milestone n="DB" type="pagelike" unit="face">
      <lb n="1">

I am not sure yet how to make numbering unambiguous

danbalogh commented 4 months ago

The summary is spot on. But why ask for a hierarchy? These are pointlike empty elements that represent transitions in the text. Sure, there is an implicit hierarchy that actually manifests in a typical text: pagelike is higher than linelike. A gridlike hierarchy is always independent of the page-line hierarchy: if you have something like the three examples under EGD §3.6.6, it does not really make sense to ask whether the word "hole" is in "line 1 of column/block/fragment A" or in "column/block/fragment A of line 1" - it's a grid: column/block/fragment A does not contain the whole of line 1, and line 1 does not contain the whole of column/block/fragment A. The word is in the intersection of column/block/fragment A and line 1, but the two are just as independent of each other as the rows and columns of a spreadsheet. Because the text proceeds linearly from line to line, and is interrupted on its way by columns/blocks/fragments (so that the "contents" of a line are always a contiguous chunk of text, whereas the "contents" of a column/block/fragment are never), we conceive of the former as hierarchically superior to the latter, and include milestones for the beginning of a particular column/block/fragment in every line, rather than encoding the beginning of a particular line in every column/block/fragment. Anyway, when one of these elements is conceived of as a virtual container (e.g. a line of text), then the contents begin right after the tag, and end right before the next tag of the same kind or, if one comes sooner, the next tag of a higher hierarchical level. In this sense, column/block/fragment being subordinate to line, we cannot speak of "the contents of a column/block/fragment" (meaning e.g. "a hobbit. Not a | ends of worms | are, sandy hole | to eat: it was a" for block C of EGD Example 3.6.6.B), because that would not be a contiguous chunk of text; instead, we can only speak of "the contents of the part of a particular column/block/fragment that falls within a particular line" (e.g. "a hobbit. Not a" for block C "of" line 1 of the same example).

Headings and labels are to be displayed as you say in your summary. (And in logical and full view, pagelike and linelike elements are also displayed inline, i.e. without a new paragraph and with inline labels. But this is already implemented.) As for numbering, we certainly don't want to enforce a hierarchy in numbering such elements. The EGD has suggestions, but allows plenty of leeway, because the numbering scheme for any given text depends both on the physical nature of the inscription and on the conventions of the subfield - e.g. Indonesianists want a different numbering scheme for copper plate lines than Indianists. The only structural element that really is a container is, as you know, <div type="textpart"> (the "boxlike" partition in the EGD's terms), and all of the milestone-class elements are hierarchically subordinate to their textpart div. Note that as per the EGD's requirements, line (and page) numbers may (and are normally expected to) re-occur in different textpart divs of a single file, but they must always be unique within a textpart and within a textpart-less document; pagelike must always have unique numbers even if there are textparts involved, while gridlike milestones must always have unique numbers as far as the identification of a certain grid component is concerned (but each of these "unique" numbers will be likely to recur in several or all lines). As far as I am aware, we do not need and have never planned to need any machine actions that would treat these elements as virtual containers (e.g. "extract the text of line 23"), so there is no need to say that such a hierarchy is anything more than implicit. Should this be desired after all, I think we can safely establish what I said above: that line hierarchy is always subordinate to page hierarchy and ignores grid hierarchy. So the "contents of line n" (or "contents of line n of textpart div A") can be defined as the text in the XML file from the <lb n="n"/> tag to the next closest pagelike or linelike tag or the end of the textpart div or the end of the edition div. But in my opinion if we need any machine actions involving these elements, they would be like "jump to the beginning of line n of the file" (or "jump to the beginning of line n of textpart div A"), which should already be possible, or at worst, "find the first occurrence of word abc after the beginning of line n (of textpart div A)".

So as far as I can see, there is no need to worry any more about the hierarchy. If you need something more absolute for a technical reason, please explain a bit more, but even then it may not be possible to establish something universally.

In a typical case, there will be at most one pagelike segmentation, exactly one linelike segmentation and at most one gridlike segmentation in a document, and even if all three of these levels are present, they will be clearly hierarchical. But there may, very rarely (like in 0.01% of inscriptions, at a guess), be special cases, such as your example with <milestone n="D" type="pagelike" unit="block"><milestone n="D$A" type="pagelike" unit="face">. Does something like this actually occur in the corpus? I cannot really conceive of a situation where this would be needed, so I would expect that it does not, i.e. that there is never more than one kind of pagelike segmentation (either <pb/> or <milestone type="pagelike"/> with a particular @unit, but not both, and not two kinds of @unit on <milestone type="pagelike"/>). But theoretically, the situation is not impossible. So if there are any instances of such a thing in the corpus, I should have a look at them to see if they make sense, or if they are just an encoding error. If there aren't any (or if there are, but I have seen them and decided they were not correctly encoded), I would not mind adding to the EGD that encoding more than one kind of pagelike division within a document (or textpart div) is forbidden, and that anyone who feels the urge to do so should first consult us for advice.

A slightly more likely case of mixed hierarchies (say, 0.1% of inscriptions?) would be a document with more than one kind of gridlike segmentation, for example an inscription whose original layout included a gridlike setup (e.g. written across the faces of a polygonal column), and the object is also broken, so we have two or more fragments whose boundaries are not the same as the boundaries of the layout grid. See my attempt at ASCII art below, where .-s stand for the writing, which is laid out in two virtual columns (so column 1 and 2 of line 1 must be read before proceeding to line 2, otherwise the columns would be a pagelike partition, not a gridlike one), and the x-s represent a line of fracture running diagonally across the stone.

......  ...x....
......  ..x.....
......  .x......
......  x.......
...... x .......
......x  .......
.....x.  .......
....x..  .......
...x...  .......

As all gridlike segmentations are independent of other hierarchies, there is again no way to suborn the grids into the hierarchy of lines (and pages, if relevant), except by arbitrarily saying that both grids are below the lowest level of the (page-)line hierarchy. As for the hierarchisation of the two grids relative to each other, I see no other way than to arbitrarily choose which to put first in the code:

<lb n="1"/><milestone unit="column" n="a"/><milestone unit="fragment" n="1"/>
or
<lb n="1"/><milestone unit="fragment" n="1"/><milestone unit="column" n="a"/>

The last point in EGD §3.6.3 anticipates this scenario, but only says that different numeration schemes must be used for the two grids; it does not say anything about the order in which their milestones should be encoded. Is there any technical reason why one should be enforced in preference over the other? Or why we should altogether avoid both and try to come up with yet another extremely complex custom encoding solution for an extremely rare special case?

michaelnmmeyer commented 4 months ago

I am trying to address two things: the physical display, and the referencing of elements with @n. The second point has further implications, I will write about it later on.

For the physical display, I think we are good as long as I can assume that gridlike elements must always be represented in the same way, whatever their @unit, and that pagelike elements must always be represented with the same type of heading (same size, etc.).

I have spotted a single valid example of the use of several gridlike elements: DHARMA_INSEIAD00039. For the use of several pagelike elements, we have DHARMA_INSCIK00090, DHARMA_INSCIK00523 and DHARMA_INSCIK00601.

Right now, <pb/> elements are in the same font as the text body, and do not appear in the table of contents. See e.g. DHARMA_INSVengiCalukya00099. But other pagelike milestones are formatted like <div type="textpart"> headings, and appear in the TOC. See e.g. DHARMA_INSCIK00601.

I guess pagelike milestones are not supposed to appear in the TOC, to avoid confusion with <div type="textpart"> headings. I also guess I should represent them differently from textpart headings, for the same reason. What formatting should I use, then? Should I make pagelike milestones look like headings, or use the same font for all milestones?

danbalogh commented 4 months ago

For the physical display, I think we are good as long as I can assume that gridlike elements must always be represented in the same way, whatever their @unit, and that pagelike elements must always be represented with the same type of heading (same size, etc.).

Yes, I would be perfectly happy with that. For pagelike elements, my suggestion is to get rid of the ⎘ icon and instead show "page" for <pb/> and the value of @unit for pagelike milestones. Alternatively, just use the same icon (and no text) for all of them, and the tooltip could then remain "Page start" for the former and " start" for the latter. (Or perhaps the first of these suggestions for physical display, and the second for logical.)

I have spotted a single valid example of the use of several gridlike elements: DHARMA_INSEIAD00039. For the use of several pagelike elements, we have DHARMA_INSCIK00090, DHARMA_INSCIK00523 and DHARMA_INSCIK00601.

I'll have to take a look at these when I'm back to work after the 8th of April.

Right now, <pb/> elements are in the same font as the text body, and do not appear in the table of contents. See e.g. DHARMA_INSVengiCalukya00099. But other pagelike milestones are formatted like <div type="textpart"> headings, and appear in the TOC. See e.g. DHARMA_INSCIK00601. I guess pagelike milestones are not supposed to appear in the TOC, to avoid confusion with <div type="textpart"> headings. I also guess I should represent them differently from textpart headings, for the same reason. What formatting should I use, then? Should I make pagelike milestones look like headings, or use the same font for all milestones?

I certainly think that all pagelike elements should be handled in the same way: same display formatting and same behaviour for TOC. If we're talking only about the physical display, then I think it's all right to make these look and behave more like headings, i.e. set off from text font (size, bold, whatever) and to include them in the TOC. I see that textpart headings are <h3> elements, so I guess pagelike headings could be <h4> or likewise <h3> with a styling that makes them appear lower-level, and a lower-level TOC entry. In physical and full, both should be shown inline.

There's a reason why these are conceived of as "pagelike". It really helps to think of these in terms of page breaks in a book; textpart divs would then be analogous to chapters in a book. The page numbers are non-intrusive and try not to interrupt the text, since they are at points where the text as an abstract thing does not break. They are only physical, and would not necessarily be identical in a reprint of the book. Conversely, chapter headings mark the beginnings of major sections of the text and would have to be at the same points in a reprint. But when we're looking at a diplomatic edition where the physicality of the text is foregrounded, it's OK to treat these in a way similar to (but still subordinate to) chapter headings.

michaelnmmeyer commented 4 months ago

OK, thank you.

danbalogh commented 4 months ago

Indeed, DHARMA_INSEIAD00039 looks like a correct example of multiple (superimposed, non-hierarchical) gridlike structures.

DHARMA_INSCIK00523 is in all probability incorrectly encoded: it has

The remaining INSCIK examples are more difficult because of their complexity, because I do not know what some of the objects actually look like, and because I don't understand the language. Here are my thoughts on them. These are partly for myself so that I can discuss the encoding later on with Kunthea, but I'd like you to skim these so that you and I can decide together about a policy toward hierarchical pagelike milestones. As far as display is concerned, the solutions we have discussed above should work for these too.

In DHARMA_INSCIK00090, there should definitely not be two instances of <milestone type="pagelike" unit="item" n="N"/>, which is an encoding error, but I don't know about the rest. What we have now is:

In DHARMA_INSCIK00601, we have:


And here we come back to the theoretical discussion. What I'd like you to help me decide is whether we should allow multilevel hierarchies of pagelike milestones, or forbid them outright. Given that we have only a very few cases where such a hierarchy may be desirable (I think my guesstimate of 0.01% was not far off), we may on the whole be better off if we eliminate this kind of complication from our encoding. Here are the alternatives as far as I can see them:

  1. Continue allowing hierarchical levels. Ideally, the hierarchical levels should tessellate, but we can't expect each encoder to keep that in mind. Encoders will, as the above cases show, mess them up somewhat, but this may not be a big problem for us, since display is straightforward and we certainly don't want to make the hierarchy machine-actionable (e.g. "collaps all sub-zones on face A"). The only hard encoding requirement then would be that no combination of unit and n occurs more than once.
  2. Enforce flat hierarchies. In any XML file, there must never be more than one @unit of pagelike milestone. In complex cases like the two Khmer inscriptions above, default to "zone" as the unit, and use numbers and labels to indicate the hierarchisation to human readers only.
  3. Enforce textpart divs. In any XML file, there must never be more than one @unit of pagelike milestone. In complex cases create textparts to encode a higher hierarchical level.

At the moment, I vacillate between 1 and 2; I'm not in favour of 3, since in the early days, we have intentionally downplayed the use of textpart divs, recommending their use in the EGD only when there is really no way to read the parts linearly as a coherent whole. What do you think?


Another theoretical question concerns the use of gridlike milestones when a grid does not affect all the lines of an inscription. I've recently had to encode a copper plate with a corner broken off, sort of like this:

.......
.......
.....x.
....x..
...x...

(contrast my earlier ASCII art above, where the fracture affected each line). So for the sake of the example, let's forget that pages are also involved, and assume that this is a stone slab with the bottom right corner broken off. Obviously, we need gridlike milestones, <milestone unit="fragment" n="a"/> and <milestone unit="fragment" n="b"/>. We certainly need a milestone for fragment b in lines 3 to 5, and a milestone for fragment a at the beginning of lines 4 and 5. But where else would you put a milestone for fragment a?

I have provisionally gone with the third solution, but I'm really uncertain about this and would appreciate your opinion.

michaelnmmeyer commented 4 months ago

For processing texts mechanically, we need to make sure our encoding remains formally verifiable. Currently, we have something that adheres to the following BNF grammar:

physical ::= page+
       | line+

page ::= '<pb/>' line*
       | '<milestone type="pagelike"/>' line*

line ::= '<lb/>' cell+
       | '<lb/>' TEXT?

cell ::= '<milestone type="gridlike"/>' TEXT?

Pagelike milestones

I am in favor of not allowing multilevel hierarchies, at least until it becomes clear we cannot do without them. We already do not allow nesting of <div type="textpart">, so it would be coherent to do the same in the present case.

On the technical side, having multiple levels prevents me from determining page boundaries: milestones essentially become visual effects that are not backed up by any "real" structure, and checking for the uniqueness of @n or producing references becomes a real mess. I also imagine people will be tempted to number elements like that:

    <milestone type="pagelike" unit="zone" n="A"/>
        <milestone type="pagelike" unit="column" n="a"/>
        <milestone type="pagelike" unit="column" n="b"/>
        <milestone type="pagelike" unit="column" n="c"/>
    <milestone type="pagelike" unit="zone" n="B"/>
        <milestone type="pagelike" unit="column" n="a"/>
        <milestone type="pagelike" unit="column" n="b"/>
        <milestone type="pagelike" unit="column" n="c"/>

Not having hierarchies would also simplify numbering rules: instead of "pagelike milestones must have a unique combination (@unit, @n)", we could just have "pagelike milestones must have a unique @n, and must always bear the same @unit". In practice, we can allow the use of different pagelike @units within the same inscription, as long as it is understood that this does not imply a hierarchical relationship. I do not know if this would be useful, though.

With this model, we would have two basic types of computer-generated references: either (page|zone|...) X, line Y (for inscriptions that have pagelike milestones), or line Y (for the others).

Gridlike milestones

Among your three solutions, the first one and the third can be mechanically processed without issue, but the second one might cause problems, because it changes the scope of gridlike elements from a single line to a full page.

Consider the following, for instance:

...x...
....x..
.....x.
.......
.......

If someone encoded it with the third method, viz. by only encoding the three initial lines, the last encoded fragment will have @n='b'. Now, if we decide to switch to the second solution, the encoding will become incorrect, because it will seem that fragment @n='b' runs to the end of the page, while it is @n='a' that does. On the other hand, switching to the first method would be harmless: lines that are not followed by milestones would not be associated with a fragment at all.

Overall, the third method is the most economical and probably the safest bet. If people want to annotate unbroken lines, as in the first method, they can also do that without issue.

danbalogh commented 4 months ago

Hmm, thanks for the detail and clarity. This will take me some time to process and I write as I think. I'm not familiar with BNF grammar, but I've managed to work out what the notation means. I'm happy that you have formulated this and agree that it covers the "normal" cases. But the principal question remains: why do you think we need a rigorous hierarchy if we are not planning to do any machine action that needs it? Also,

having multiple levels prevents me from determining page boundaries

Supposing that we keep allowing more than one @unit of pagelike milestone, a "page" is only interpretable with respect to one of those units. The boundaries of any page will be FROM one milestone with a particular unit TO the next milestone with the same unit OR to the end of the container (edition div or textpart div) .

milestones essentially become visual effects that are not backed up by any "real" structure,

Yes, but as I keep saying, this is what they are meant to be. Why do we need there to be any real structure? Am I missing something?

and checking for the uniqueness of @n or producing references becomes a real mess.

I don't think so. The @n of lines must already be unique within every container (edition or textpart). Likewise, the n of any milestone with a particular unit must be unique. We are not looking to make line numbers unique only within a certain "page": if someone wants to use the system of restarting line numbering on every page, then they must use composite line numbers (A1, A2... B1, B2... etc.) Unless I'm missing something important, checking uniqueness within the containing div should be simple. Same for references, as I said in my first post in this thread. Referencing points (the exact location of the milestone) should be sufficient for our purpose and I don't think we'll ever want references to spans (from the milestone to the next milestone of the same kind or the end of the container), but if we do, they should be feasible so long as it is understood that IF multiple pagelike hierarches are present, a reference can only use one of those and not a combination.

I also imagine people will be tempted to number elements like that

They have to resist the temptation. The EGD says and has always said that the @n of pagelike milestones must be unique.

That said, I'm still inclined to forbid multiple pagelike hierarchies. It's not that I'm against it, it's just that I'd like to know what we might gain by doing so, and so far the slight simplification of the rules (and the resulting reduced chance of human error) is the only real advantage that I see; I do not see a significant gain in processing.

Comments on that are welcome. I'll come back to the question of gridlike milestones later.

michaelnmmeyer commented 4 months ago

It is not that we need a rigorous hierarchy, but that there is already one, albeit implicit. For instance, the rule:

The boundaries of any page will be FROM one milestone with a particular unit TO the next milestone with the same unit OR to the end of the container (edition div or textpart div) .

does not apply to a structure like this:

<milestone type="pagelike" unit="item" n="L">
    <milestone type="pagelike" unit="zone" n="L1">
    <milestone type="pagelike" unit="zone" n="L2">
<milestone type="pagelike" unit="item" n="N">
<milestone type="pagelike" unit="item" n="S">

If you do not know that L2 is a subunit of L, you will consider that L2 runs up to the end of the inscription.

Likewise, milestones order is significant. For instance, the above is not the same as:

<milestone type="pagelike" unit="item" n="L">
    <milestone type="pagelike" unit="zone" n="L1">
<milestone type="pagelike" unit="item" n="N">
    <milestone type="pagelike" unit="zone" n="L2">
<milestone type="pagelike" unit="item" n="S">

Now, I can treat milestones as black boxes that have no relationship whatsoever, but I see from a mile away people asking me things like "can you put my 'item' milestones headings in a larger font?", "can you add more space between my 'item' milestones?", etc. Even for such seemingly simple tasks, I need to parse the text into a tree-shaped structure.

This complicates the above grammar a lot. We end up with something like:

physical ::= pages
           | line+

pages ::= page+
        | zone+
        | block+
        | face+

page ::= '<pb/>' line*
       | '<pb/>' page_zone+
       | '<pb/>' page_block+
       | '<pb/>' page_face+

page_zone ::= '<milestone type="pagelike" unit="zone">' line*
            | '<milestone type="pagelike" unit="zone">' page_zone_block+
            | '<milestone type="pagelike" unit="zone">' page_zone_face+

page_zone_block ::= '<milestone type="pagelike" unit="block">' line*
                  | '<milestone type="pagelike" unit="block">' page_zone_block_face+

page_zone_block_face ::= '<milestone type="pagelike" unit="face">' line*

page_zone_face ::= '<milestone type="pagelike" unit="face">' line*
         | '<milestone type="pagelike" unit="face">' page_zone_face_block+

page_zone_face_block ::= '<milestone type="pagelike" unit="block">' line*

page_block ::= '<milestone type="pagelike" unit="block">' line*
             | '<milestone type="pagelike" unit="block">' page_block_zone+
             | '<milestone type="pagelike" unit="block">' page_block_face+

page_block_zone ::= '<milestone type="pagelike" unit="zone">' line*
                  | '<milestone type="pagelike" unit="zone">' page_block_zone_face+

page_block_zone_face ::= page_zone_block_face

page_block_face ::= '<milestone type="pagelike" unit="face">' line*
          | '<milestone type="pagelike" unit="face">' page_block_face_zone+

page_block_face_zone ::= '<milestone type="pagelike" unit="zone">' line*

... and so on for all possible combinations of pagelike milestones.

danbalogh commented 4 months ago

If you do not know that L2 is a subunit of L, you will consider that L2 runs up to the end of the inscription.

THIS is what I've missed, thanks for pointing it out. So let us then provisionally agree to forbid more than one kind of pagelike milestone within a div. I assume that it is not a problem if we have something like this:

<div type="textpart" n="A"><label>Lintel</label>
    <milestone type="pagelike" unit="zone" n="L1"/><label>Left side</label>
    <milestone type="pagelike" unit="zone" n="L2"/><label>Right side</label>
</div>
<div type="textpart" n="B"><label>Doorjambs</label>
    <milestone type="pagelike" unit="item" n="N"/><label>Northern doorjamb</label>
    <milestone type="pagelike" unit="item" n="S"/><label>Southern doorjamb</label>
</div>

Please confirm this would be OK.

I'll then schedule a talk with Kunthea to understand how the complex Cambodian inscriptions are laid out and tell her how best to encode them. I'm going to recommend a combination of my alternatives 2 and 3 above, i.e. to prefer a flat hierarchy with just one kind of pagelike milestone, and to fall back to textpart divs if a multilevel hierarchy is essential. Does that sound all right to you? Once we are done with that, in due time I'll write up the new rules for the next release of the EGD, and may have more questions about this to you later on.

danbalogh commented 4 months ago

Re gridlike milestones: thanks for your thoughts on this. I shall then stick to the third method and prescribe it explicitly in the next EGD.

arlogriffiths commented 4 months ago

Thanks @danbalogh and @michaelnmmeyer for working this out so patiently.

As for the discussion with Kunthea, I think it might be best if I participated too, but a practical dilemma is that I will be only fieldwork in Vietnam from 13 through 28 April, so practically it would have to be tomorrow or after the 28th.

michaelnmmeyer commented 4 months ago

@danbalogh Yes, this works.

manufrancis commented 4 months ago

Yes, indeed, thanks to both of you, Daniel and Michaël!!!