erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Referencing documents, parts of documents, bibliographic entries, etc. #244

Closed michaelnmmeyer closed 6 months ago

michaelnmmeyer commented 11 months ago

This is related to #243

We are currently using a variety of reference systems. For instance:

<p n="3" corresp="#siksaguru_01.03"> (points to <p xml:id="siksaguru_01.03"> in another file)
<change who="part:axja" ...>
<persName ref="http://viaf.org/viaf/39382787">
<licence target="https://creativecommons.org/licenses/by/4.0/">
<ref>DHARMA_IdListMembers_v01.xml</ref>
<ref target="DHARMA_INSCIK00288.xml">
<idno type="filename">DHARMA_INSCIK00803</idno>
<rdg source="bib:Coedes1942_02">

The use of corresp="#siksaguru_01.03" for referencing an @xml:id in another document is problematic. An @xml:id is supposed to be unique within a single document, not at the scope of a full document collection. A typical address for part of a document is DHARMA_INSHello#location, where DHARMA_INSHello is the file name and location is an @xml:id. When referencing something in the same document, the filename is not necessary, #location suffices.

We can choose not to follow XML conventions, but in any case, we must have a referencing scheme that clearly distinguishes the file name and the location of the target element within this file. Otherwise, I basically have to process all the files in the collection to figure out where the @xml:id lies, instead of processing a single file.

Secondly, we need to be careful to avoid name clashes in references. For instance, we cannot have both DHARMA_INSHello_1.2.3 for referencing a @n=1.2.3 and DHARMA_INSHello_location for referencing an @xml:id=location, because it is not possible to tell from the reference itself whether it refers to a @n or to an @xml:id.

danbalogh commented 11 months ago

I do not really understand what you are saying here. I have no knowledge of how @corresp is used in the project to point to specific locations in specific files, nor of how @xml:id is used for such a purpose, or of what considerations went into these practices. I do know that it was Axelle who insisted on making all of our @xml:id-s unique through the entire corpus by including the filename within those IDs; I never understood why this was important. Now that you point out the problem of having always to process all files when looking for a specific ID, I understand why this is a problem, and understood all the less why Axelle thought it was a good idea. If the PIs agree that we should change to the regular convention and make IDs unique only within each document, then refer to them using #location for internal and filename#location for external references, then I think it should be possible to auto-update our existing IDs thanks to the rigorous file naming conventions in use. I may be wrong here, but I think it should be possible to do this:

  1. In each file, in any instances of an @xml:id, occurrences of the current file's name plus a trailing underscore would be simply deleted. Thus, within the file DHARMA_INSHello, @xml:id="DHARMA_INSHello_1" would become @xml:id="1". At the same time, remaining instances of @xml:id containing an underscore might be flagged for human attention, but this may not be necessary.
  2. Next, in all files, in instances of attributes referencing an xml:id (but not in @xml:id attributes themselves), any occurrences of any DHARMA filename (regex mask on the basis of the file naming conventions) plus a trailing underscore would be replaced to the same filename followed by a hash mark instead of the trailing underscore, so that within all files, @corresp="DHARMA_INSHello_1" would become @corresp="#1" and @corresp="DHARMA_INSBye_1" would become @corresp="DHARMA_INSBye#1".

If it is not possible to write failsafe regex masks for all DHARMA filenames, then this might also be done using a list of actual existing DHARMA filenames. All this of course would have to be synchronised with a change in the practice of all encoders who use such references, which would not be easy. And at any rate @arlogriffiths at least will certainly have to be involved more deeply in any problems related to encoding practices found only in the EGC and not in the EGD.

I'm not sure if you have any particular problem with the other kinds of reference you list. As far as I can see, the following

<persName ref="http://viaf.org/viaf/39382787">
<licence target="https://creativecommons.org/licenses/by/4.0/">
<ref>DHARMA_IdListMembers_v01.xml</ref>
<idno type="filename">DHARMA_INSCIK00803</idno>

should not be problematic, or if they are, then they should be straightforward to change to whatever arrangement you think is preferable.

Next, the following

<rdg source="bib:Coedes1942_02">
<change who="part:axja" ...>

have, I think, been implemented carefully and work as expected. If they cause any conflicts or problems, then (apart from problems with referring to the Zotero database) these should be possible to correct globally.

Finally, to <ref target="DHARMA_INSCIK00288.xml"> I should add the type <ref n="tfa-pallava-epigraphy" target="Pallava00001.xml"> where @n has been coopted to identify the repository for files located in a different repository than the one from which the reference is being made. If any of these referencing systems are problematic, then we'll need to know more about the problems and the proposed solutions.