erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Numbering of chapters, verses, milestones, etc. #243

Closed michaelnmmeyer closed 5 months ago

michaelnmmeyer commented 11 months ago

We want to be able to refer to precise locations in a document, as in A2, 1.1, 3r1, etc. We want these locations to be unique. For now, this is rather difficult.

Indeed, we allow both a repetitive scheme (within a <div n="A">, we have A.1, A.2, A.3, etc.) and a non-repetitive one (within a <div n="A">, we have 1, 2, 3). Since there are no particular constraints on number formats, it is not possible for a program to tell, in the general case, whether @n holds a single number or combines several. For instance, should "A1" be interpreted as [A1] (one level) or [A, 1] (two levels)? I need an explicit indication that allows me to tell unambiguously which scheme we have, or at least a single numbering convention. Maybe enforce the use of a single field separator (period, etc.)?

There is also a difficulty with the uniqueness of @ns. When dealing merely with elements that represent the logical structure and that nest within each other (<div>, <lg>, <l>, etc.), producing a reference number and checking its uniqueness is straightforward, because the scope of elements and the way they nest is explicit in the XML. But when we introduce elements that represent the physical structure (<lb>, <milestone>, <pb>), it is not clear at all.

Firstly, because the extent of these elements is unspecified. For instance, when does a <milestone unit="face"/> end? When the next milestone with an identical @unit begins? We also have <milestone unit="faces"/>.

Secondly, because it is unspecified how these elements can nest with each other and with logical elements. For instance, can a <milestone unit="zone"> contain a <milestone unit="column">? Or the reverse? Or both? Or maybe they have no relation whatsoever and should be treated independently, like e.g. <lb/> and <p>.

Generally speaking, I cannot do much with milestone-like elements, precisely because I cannot tell how they relate to each other and to the logical structure. The physical display is very fragile for this reason, and I am not even speaking about search.

danbalogh commented 10 months ago

I believe that the answers to most of these questions are there in EGD §3, in particular §3.1, §3.2, §3.5 and §3.6. It is long and complicated, but that is because the original documents we are dealing with are complicated.

We want to be able to refer to precise locations in a document, as in A2, 1.1, 3r1, etc. We want these locations to be unique. For now, this is rather difficult.

Indeed. I don't think you need to make it a priority to refer to milestones. Referring to <lb> and <lg> elements should be sufficient for most purposes. Also, gridlike milestones (EGD §3.6) will by definition occur more than once in a document and therefore not have unique numbers - although we have just discussed an exception briefly with Arlo (#236). This exception could be made a general rule, but I do not believe it is worth the added complexity of encoding, display and reader reception.

Indeed, we allow both a repetitive scheme (within a <div n="A">, we have A.1, A.2, A.3, etc.) and a non-repetitive one (within a <div n="A">, we have 1, 2, 3). Since there are no particular constraints on number formats, it is not possible for a program to tell, in the general case, whether @n holds a single number or combines several. For instance, should "A1" be interpreted as [A1] (one level) or [A, 1] (two levels)? I need an explicit indication that allows me to tell unambiguously which scheme we have, or at least a single numbering convention. Maybe enforce the use of a single field separator (period, etc.)?

The above seems to pertain only to line numbers (not to chapters, verses, milestones, etc.). If it was meant to pertain to something else, please clarify.

  1. Important: in complex line numbers (A1, etc.), the higher "level" corresponds to pagelike partitions (e.g. <pb/>), not to text containers (like <div>). A line numbered A1 is "line 1 on page A", and not "line 1 in container A". This essentially means that for all technical purposes that I can imagine, line numbers are always on one level. I cannot foresee any situation where, either in display or in processing, the components of a complex line number need to be separated. The line number (the value of lb/@n) is just an indivisible string for all purposes.
  2. Unless the encoder has made mistakes, line numbers have to be unique within each textpart div of an edition (EGD §3.2.2), as you observe below. Since the apparatus div of an edition has to reproduce the textpart divs of the edition division, references to a line number from an apparatus should already be interpretable: a @loc attribute in the apparatus points to an lb/@n in the edition which is a descendant of a textpart div of the same @n as the textpart div in the apparatus of which the <app> element is a descendant (none if there are no textpart divs in the apparatus). The same applies to the translation division, if it is desired to link the @n of non-stanza translation paragraphs to lines of the edition. Referring to specific line numbers of an edition from outside that edition is not on the agenda as far as I am aware; if it is or will soon be, then I don't think there is a way to implement it without also requiring reference to textpart number (where applicable). The only other way would be to enforce line number uniqueness throughout every file, which would on the one hand require revising hundreds of already encoded texts, and on the other hand require a new order of complexity in line numbers (div number included in the line numbers in addition to the already present optional complexity of page etc. number). This would not only make our code even heavier, but would also be undesirable in the case of many documents, for instance those that (like most of my texts) have a seal inscription and a text on copper plates: we want to be able to speak of line n of the main text simply as line n and not as line B.n or suchlike.
  3. Given (1) above, I see no need for the machine to be able to tell whether any line number is a simple or a complex one. The line number should be displayed in the same way whichever it is, and as far as referencing is concerned, (2) still applies: every number will be unique within a given textpart div (unless the encoder has made a mistake, which must be corrected). But if it is important for some reason, then the EGD 3.2.2 says, "by default, simple line numbers shall be Arabic numerals starting from 1" - which means that any line number which contains alphabetic characters is a complex one, the alphabetic part representing the higher (pseudo-)level, and the numeric part representing the lower.

There is also a difficulty with the uniqueness of @ns. When dealing merely with elements that represent the logical structure and that nest within each other (<div>, <lg>, <l>, etc.), producing a reference number and checking its uniqueness is straightforward, because the scope of elements and the way they nest is explicit in the XML. But when we introduce elements that represent the physical structure (<lb>, <milestone>, <pb>), it is not clear at all.

Could you be more specific? What is not clear? lb/@n must be unique within any <div type="textpart"> (as above). The @n of <pb> and <milestone type="pagelike"/> must be unique throughout the document (EGD §3.5.4). In the few documents that contain more than one kind of pagelike milestone, the EGD prescribes using different numeration schemes for them, so the number of each will remain unique. Situations involving pagelike partitions in more than one textpart div have, I think, not been addressed explicitly in the EGD. I don't expect them to occur with any noticeable frequency, but perhaps they do (thinking again of SE Asian stelae and doorjambs). In principle, the requirement of uniqueness for each number throughout the document applies to these as well, but if there are cases of such documents and they use non-unique numbers, then this requirement could be modified to require uniqueness only within each textpart div, just as in the case of line numbers. Finally, gridlike milestones are by default non-unique and that's that. See above. We don't need to refer to them or to do anything about them except display them as labels. However, should (in the long run) referring to such a milestone be required, the reference would have to point to every occurrence of such a milestone with a given @n, of which there will be 0 to 1 per line of the inscription.

Firstly, because the extent of these elements is unspecified. For instance, when does a <milestone unit="face"/> end? When the next milestone with an identical @unit begins? We also have <milestone unit="faces"/>.

I don't think there is a need to conceive of these elements as having an extent. They are to be displayed as points, and if they are used in reference, then they are to be interpreted as points, not as anything containing text. If I am mistaken here, and it is necessary for some reason to treat such elements as virtual text containers, then yes, their extent up to the next element of the same kind with an identical @unit (or to the end of the edition div or textpart div when there are no subsequent elements of the same kind). [Also, gridlike milestones may be conceived of ending at an <lb/> if one is reached before reaching the next milestone of the same unit, but this is a very small difference, since the <lb/> will be followed immediately by a milestone of the same unit. So the only difference is whether or not we conceive of the line break as "included" within the scope of the milestone.] And yes, we also have <milestone unit="faces"/>. See EGD §3.5.4.

Secondly, because it is unspecified how these elements can nest with each other and with logical elements. For instance, can a <milestone unit="zone"> contain a <milestone unit="column">? Or the reverse? Or both? Or maybe they have no relation whatsoever and should be treated independently, like e.g. <lb/> and <p>.

I fail to see the problem. Why do we even need to conceive of these pointlike empty elements as if they had content? If we do not, then the question of their nesting with each other does not arise. If for some reason this is necessary, then indeed, they are to be treated independently. Each milestone/@unit is a separate hierarchy and need not be treated as if interacting with any other hierarchies. In principle, the hierarchies could be conceived of as nested, with boxlike partitions (textpart div) at the top, then pagelike partitions(<pb/> and <milestone type="pagelike">), then line numbers, then gridlike milestones. Theoretically, a <milestone unit="zone"> could "contain" a <milestone unit="column"> or the reverse. These are both gridlike milestones, so it is unlikely that either would happen. Since gridlike milestones are the least containerlike, two gridlike milestones with different units might even overlap, for example if an inscription was written in visual columns encoded as milestones, and it was broken into several fragments (which don't respect the columns), also encoded as milestones. Theoretically, different kinds of pagelike milestones might also be "contained" within one another, so <milestone type="pagelike" unit="zone"> could also "contain" a <milestone type="pagelike" unit="column"> or the reverse - but I cannot conceive of a situation where pagelike milestones would overlap with other pagelike milestones.

Generally speaking, I cannot do much with milestone-like elements, precisely because I cannot tell how they relate to each other and to the logical structure. The physical display is very fragile for this reason, and I am not even speaking about search.

Please be specific. What is it that needs to be done with milestone-like elements and meets difficulties? How is the physical display fragile?