Edelweiss / hgv

Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens
MIT License
1 stars 0 forks source link

Recursive `reprint-in` and `reprint-from` #190

Open samosafuz opened 1 year ago

samosafuz commented 1 year ago

As was noted in the discussion to #180: the dummy header of /ddbdp/rom.mil.rec;1;11 previously had reprint-in where it should have had reprint-from: as a result, both rom.mil.rec;1;11 and stud.pal;14;8C had reprint-in dummy headers pointing to one another, producing a recursive loop that no doubt confused the numbers server and prevented the retrieval of the file.

We should determine whether similar recursive loops occur elsewhere, i.e., where two files have reprint-in or reprint-from dummy headers in which the values of //ref[@type = "reprint-in" or @type = "reprint-from"]/@n point to one another. Something along these lines may work:

This issue is unique to DDB. The process can be repeated for @type= "reprint-from"

A related (but separate) issue is instances where //ref[@type = "reprint-in" or @type = "reprint-from"]/@n has a value but //ref is empty, or where //ref has a value but //ref[@type = "reprint-in" or @type = "reprint-from"]/@n is empty. These files should also be identified, so that we can populate everything appropriately.

samosafuz commented 1 year ago

As usual, I'm needlessly complicated: instead of generating filenames via base-uri(), it's easier to simply use the value of <idno type="ddb-hybrid">; this way, it's also no longer necessary to convert the output of string(). Sorry.

samosafuz commented 1 year ago

I wrote an XQuery to retrieve the following values for $file in collection("/db/papyri/idp.data/DDB_EpiDoc_XML")//tei:ref[@type = "reprint-in" or @type = "reprint-from"]:

The results are now in a Google sheet: https://docs.google.com/spreadsheets/d/1HXyRmGZ5qnBULswYcZIul1OAf84mjUuNMW2NiChuXUA/edit#gid=0

I haven't tried to line up the items that recursively point to one another yet, but this sheet is helpful for identifying places where reprint-in reprint-from both appear in the @type column, and where @n is empty.

Edelweiss commented 6 days ago

As samosafuz pointed out, there are various muddles to deal with before being able to do a proper search for cycles on the graph of reprints (e.g. p.oxy;44;3208|chla;47;1420|c.ep.lat;;10), so as not to try following a piece of reprint information with a wrong ddb-hybrid or end up in a blind alley because the @n attribute is missing.

jcowey spotted several clusters of erroneous or incomplete mark-up that lead to muddled reprint information, such as:

He suggested opening tickets to tackle these faults beforehand.