erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Use of `<citedRange>` in `<bibl>` #253

Open michaelnmmeyer opened 8 months ago

michaelnmmeyer commented 8 months ago

For <citedRange>, we have two conflicting rules:

We can probably assume that something like 82, 87-90, 93, 100-105 always represents a page range, but then there are a lot of <citedRange> without a unit that look like this:

199 n. 4
CVIII
no. Ghan 1
258-9 (III)
1938-39: 71, no. B.452
2, caption of fig. 3
3, 15, 24-25 (H)
3, 4 n. 1 and 7
109 (no. 13B), 122 (no. 18A), 124 (no. 21), 128 (no. 26)
2004-05: 206 (7)
58-61, ills. 126-36
[...]

Assuming that anything that starts with a digit is a page range would still be OK in most cases, but it would be preferable to follow a convention that removes the ambiguity.

Likewise, it is impossible to tell unambiguously whether the contents of <citedRange> represents a single item (page, volume, etc.)---in which case the singular form of the unit should be used (p., vol., etc.)---or several items---in which case the plural form is needed (pp., vols., etc.). It is reasonably clear for cases like 82, 87-90, 93, 100-105 (only digits, hyphens and commas), but beyond that I cannot do much. We have references like:

B/1965-1966
E.42-E.45
IR 91
1077, note 7
110 (no. 13D) = 127 (no. 24)

As a reminder, here is how each value of @unit is displayed:

@type     singular  plural
---       ---       ---
volume    vol.      vols.
appendix  appendix  appendixes
book      book      books
section   §         §§
page      p.        pp.
item      №         №
figure    fig.      figs.
plate     plate     plates
table     table     tables
note      n.        nn.
part      part      parts
entry     s.v.      s.vv.
line      l.        ll.
arlogriffiths commented 8 months ago

I will leave it to @danbalogh to formulate a more comprehensive response, but I just want to point out that I expect that all or most of the items on your list of aberrant references

B/1965-1966 E.42-E.45 IR 91 1077, note 7 110 (no. 13D) = 127 (no. 24)

are found in files imported from the EIAD database or in files whose encoding has never been finished (e.g. the files that Marine Schoettel started to encode before dropping out as PhD student). I certainly don't think we need to take those examples into account as evidence for inadequacy of our encoding guidelines. If you can furnish me a list of files with aberrant <citedRange> encoding, then I can try to improve the encoding in all offending cases.

michaelnmmeyer commented 8 months ago

@arlogriffiths I have not noticed a particularly problematic corpus on this issue. The examples I gave come from various sources.

danbalogh commented 8 months ago

I had raised the matter of the ambiguity of <citedRange> without @unit in an EGD comment in June 2021, when the idea of using this for encoding complex citations was first introduced. The comment thread lost momentum, and the issue is still unresolved. My proposal was and is to require @unit="other" for complex references, and to allow <citedRange> without @unit only if the contents are nothing but a page number or numbers. The matter is summarised in the EGD Leftovers in the section "Unstructured references with citedRange".

On displaying singular and plural forms, I was not aware this has been (or is being) implemented for any unit other than page. (See also the earlier discussion about "s.vv" summarised in another part of the Leftovers, which may thus be superseded.) I think the importance of determining whether a plural or singular form need be used is very small relative to its costs. To explain. First, it needs great investment. A blind algorithm would not be able to do the work in most cases; for instance, among your examples, "B/1965-1966" is a (single) appendix number of a format used in almost all volumes of the Annual Report on Indian Epigraphy, so something like it will be cited in most editions of Indian inscriptions. The algorithm would thus have to use a complex list of exception formats and could not rely just on the presence of a hyphen to determine that we are dealing with a range. Second, plural units of most kinds are likely to be exceedingly rare: one hardly ever needs to refer to multiple appendices, volumes, books, etc. of a publication. And third, the matter is purely aesthetic. As far as I know, there are no plans ever to make these machine-actionable, so the only purpose of determining whether a singular or a plural is appropriate is to have a pedantically accurate display. NOTE: If we do really need the plural of appendix, I urge that it should be appendices, not appendixes.

Now, given these three considerations, I think plural unit displays should be discarded for most cases and retained only for @unit="page", or possibly page and a limited number of other units where we can reasonably expect to refer to a list or range more than once in a blue moon, AND expect a blind algorithm based on the presence of a hyphen or comma to work (e.g. note). Next, for those rare cases where a reference to a list or range of other units is needed, we have two alternatives. A, we can live with the fact that the display will be singular (which was and remains my recommendation for "s.v.", see the second Leftovers link above); or B, we can decide that all cases where a plural is essential, the "complex reference" style should be used (without @unit or with @unit="other" depending on the first decision), where the encoder will manually enter whatever s/he sees fit.

In your list of problematic <citedRange> contens without @unit, some of the items are obscure to me, but

Finally, I notice that your list of possible units includes "line". This value is not permitted in the EGD. Do any other guidelines propose using it somewhere? Is it used anywhere in an XML file? In the same EGD discussion thread that raised using @unit="other" for complex references, I had also noted that we might want to add "line" and "stanza" as possible units, but ultimately we decided against this. Axelle may have misunderstood and added "line" nonetheless, or - since "stanza" is not on the list - perhaps "line" has been added for some other reason. At any rate, if we keep this as a permitted value of unit, then I need someone to tell me when its use is recommended, so I can add it to the EGD. But if it isn't already in use, then I'd prefer discarding it.

michaelnmmeyer commented 8 months ago

For @unit="line", there are about 15 files (only inscriptions) that use it. It appears in one example in the EGC, but nobody currently uses it in critical editions. Discarding it would thus have minimal impact.

Generally speaking, I concur with anything that would simplify the encoding of bibliographic references. It is impossible to mechanically extract useful information from <citedRange>, because references styles and numbering systems vary too much, so the only practical purpose of <citedRange> is to format citations automatically.

Even so, the complexity/benefit ratio is very high. From what I have seen so far, I do not think we gain much consistency from using many types of @units and references formats. Free text references like vol. VI, pp. 48-49 are easy to read and modify, so letting people use these might work just as well.

danbalogh commented 8 months ago

So it seems we are pretty much of the same mind here. We'll need @arlogriffiths to give his opinions on my suggestions above (namely: mandatorily adopting <citedRange unit="other"> for non-rigorous reference details, and handling only simple references rigorously with dedicated units, while changing multi-unit references to the non-rigorous form), as well as on whether he thinks @unit="line" is necessary or can be discarded (and deleted from the EGC example). If at the moment it occurs only in inscription editions, and we discard it, then those editions will have to be changed manually, moving the line reference outside the bibl element.

arlogriffiths commented 8 months ago

No objections from my side, except that I find the value "other" unsatisfactory for the intended use. What about alternatives like "free-form", "misc", "mixed"?

I'd favor maintaining the unit "line", as it is rather customary to refer to (pagination and) lineation of certain works.

danbalogh commented 8 months ago

Thanks, Arlo. Of the suggested values for @unit, I prefer "mixed", so let's go with that unless there is an objection. (I note, however, that it doesn't really describe the cases where there is no automatic plural, and so e.g. "Appendices" is encoded in this way. This is not a problem for me, but it may be for some.) I don't mind retaining "line". I'll now try to summarise what needs to be done.

  1. I'm adding "line" to the next EGD as a possible value of @unit and writing instructions for the use of "mixed" instead of no @unit (to be pushed soon)
  2. We'll need to change existing instances of mixed references from no @unit to @unit="mixed" in the corpus. This will need some thoughtful automation and some manual checking. I believe that all instances of <citedRange> without @unit whose contents include anything other than Arabic numerals, spaces, hyphens and commas should be changed to have @unit="mixed". If there are hundreds of these in the corpus, then checking them manually may not be feasible, but I believe there shouldn't be more than a couple of dozen. If the latter is the case, I think it would be best to flag each of the auto-changed instances for the encoder's attention. One way to do that would be to add an XML comment (e.g. <!--CHECK Reference-->) in the course of refactoring, but if @michaelnmmeyer has a better idea, then let's go by that. If Michaël can then create a list of files (preferably by corpus) that include such references, then these could be sent to the respective encoders, alerting them to go and check those citations. I can help them use Oxygen to search their corpus for the flags and talk to them about the desired way of encoding citations.
  3. We'll need to come to a decision about which permitted values of @unit are to have a smart display with plurals. Those for which this is not feasible to implement are then to be displayed mechanically in the singular, and if an encoder desperately needs a plural reference, it'll have to be encoded as a mixed type. Once this is finalised, I'll add this info to the EGD in the part where citation display and bracketing are discussed. I suggest the following specifics:
@unit singular action
volume vol. no plural
appendix appendix no plural
book book no plural
section § no plural OR plural §§ if contents include ", " (comma followed by space) or - (hyphen)
page p. plural pp. if contents include ", " (comma followed by space) or - (hyphen)
item no plural
figure fig. no plural
plate plate no plural
table table no plural
note n. no plural
part part no plural
entry s.v. no plural OR plural s.vv. if contents include ", " (comma followed by space)
line l. no plural OR plural ll. if contents include ", " (comma followed by space) or - (hyphen)

I'll need your opinions (should we poll anyone else?) on how hard we should try to implement plurals for anything other than page. I believe that some section numbers may include hyphens, and the same may be true of complex line numbers of some form (e.g. "line A-1"), so if the presence of a hyphen in the contents prompts the plural form, then these should be avoided. We can either formulate the guidelines to tell encoders to omit any hyphens present in the actual numeration they are referring to and use a hyphen only when referring to a range (then hope they read and remember this), or live with the occasional false positive (a plural display where a singular had been intended), or not use auto-generated plurals and instead direct encoders to use a mixed unit when they really want a plural reference. I think the simplest both to implement and to keep in mind would be to auto-generate plurals only for @unit="page" and generate only the singular for everything else.