DCLP / dclpxsltbox

Sandbox for development, testing, and review of XSLT for DCLP
http://dclp.github.io/dclpxsltbox/
1 stars 5 forks source link

Malformed dclp-hybrid values #317

Closed hcayless closed 6 years ago

hcayless commented 6 years ago

There are 233 DCLP documents that have broken dclp-hybrid <idno> values. A full list can be found at https://gist.github.com/hcayless/0c99cb6af2b27239f397ca854e52e677. They all seem to be P.Herc. docs.

This error prevents correct indexing of the documents for search.

jcowey commented 6 years ago

@HolgerEssler this needs to be sorted as soon as we can manage. Do you want to have all P.Herc. publications collected under one standardised dclp-hybrid so that they ressemble e.g. p.oxy;12;1234? If the answer is yes then we have to change <idno type="dclp-hybrid">P.Herc. 1120</idno> into <idno type="dclp-hybrid">p.herc;;1120</idno> (that assumes no volume)

If the answer is no, then we replace these dclp-hybrid with the relevant "na;;23456" value, that is "na" (viz. no author) followed by the TM number.

HolgerEssler commented 6 years ago

Yes, please change <idno type="dclp-hybrid">P.Herc. 1120</idno> into <idno type="dclp-hybrid">p.herc;;1120</idno>. I suppose <idno type="dclp-hybrid">P.Herc. 1043 + 1045</idno> should then become <idno type="dclp-hybrid">p.herc;;1043;1045</idno> and <idno type="dclp-hybrid">P.Herc. 419, 697, 1634</idno> should become <idno type="dclp-hybrid">p.herc;;419;697;1634</idno>. Would that be ok?

hcayless commented 6 years ago

I would recommend something like:

<idno type="dclp-hybrid">P.Herc. 1043 + 1045</idno> -> <idno type="dclp-hybrid">p.herc;;1043+1045</idno> and <idno type="dclp-hybrid">P.Herc. 419, 697, 1634</idno> -> <idno type="dclp-hybrid">p.herc;;419,697,1634</idno>

paregorios commented 6 years ago

@jcowey can this be done in Heidelberg?

On Thu, Aug 17, 2017 at 10:58 AM Hugh A. Cayless notifications@github.com wrote:

I would recommend something like:

P.Herc. 1043 + 1045 -> p.herc;;1043+1045

and

P.Herc. 419, 697, 1634 -> p.herc;;419,697,1634

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DCLP/dclpxsltbox/issues/317#issuecomment-323116593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQEdfx_I6noYANOuw16ihBHGNdz9ckKks5sZGMPgaJpZM4O6V6p .

-- -- Tom Elliott, Ph.D. Associate Director for Digital Programs and Senior Research Scholar Institute for the Study of the Ancient World (NYU) http://isaw.nyu.edu/people/staff/tom-elliott

Humanities Commons: @paregorios https://hcommons.org/members/paregorios/ OrcID: 0000-0002-4114-6677 http://orcid.org/0000-0002-4114-6677

hcayless commented 6 years ago

Possibly this warrants a new ticket, but a number of DCLP records have other problems in the dclp-hybrid idno, namely characters that cause problems in processing. See the list below:

https://github.com/DCLP/idp.data/tree/master/DCLP/220/219977.xml: o.frangé;;438 https://github.com/DCLP/idp.data/tree/master/DCLP/220/219978.xml: o.frangé;;439 https://github.com/DCLP/idp.data/tree/master/DCLP/221/220283.xml: o.frangé;;745 https://github.com/DCLP/idp.data/tree/master/DCLP/51/50747.xml: o.wångstedt;;80 https://github.com/DCLP/idp.data/tree/master/DCLP/59/58962.xml: p.genève[horssérie];;1 https://github.com/DCLP/idp.data/tree/master/DCLP/60/59648.xml: p.genève[horssérie];;3 https://github.com/DCLP/idp.data/tree/master/DCLP/63/62158.xml: p.genève[horssérie];;6 https://github.com/DCLP/idp.data/tree/master/DCLP/63/62913.xml: p.genève[horssérie];;2 https://github.com/DCLP/idp.data/tree/master/DCLP/64/63053.xml: p.murabba'ât;2;108 https://github.com/DCLP/idp.data/tree/master/DCLP/64/63210.xml: p.murabba'ât;2;109 https://github.com/DCLP/idp.data/tree/master/DCLP/64/63211.xml: p.murabba'ât;2;110 https://github.com/DCLP/idp.data/tree/master/DCLP/64/63212.xml: p.murabba'ât;2;111 https://github.com/DCLP/idp.data/tree/master/DCLP/64/63324.xml: p.murabba'ât;2;112 https://github.com/DCLP/idp.data/tree/master/DCLP/66/65799.xml: p.murabba'ât;2;122 https://github.com/DCLP/idp.data/tree/master/DCLP/70/69159.xml: p.demarée;;5

paregorios commented 6 years ago

@jcowey ?

jcowey commented 6 years ago

pretty sure I know how to fix this and will do so

o.frangé;;438 => o.frange;;438; as in ddbdp o.wångstedt;;80 => o.wangstedt;;80; will have to be added to collection.rdf p.genève[horssérie];;1 => p.geneve[horsserie];;1; will have to be added to collection.rdf p.murabba'ât;2;108 => p.mur;2;108; as in ddbdp p.demarée;;5 => p.demaree;;5; will have to be added to collection.rdf

is that analysis correct @hcayless ?

hcayless commented 6 years ago

I'm not sure how the square brackets will play. We'll have to see.

jcowey commented 6 years ago

So have now created a new issue #324, to keep these two separate.

jcowey commented 6 years ago

@hcayless would you please check that <idno type="dclp-hybrid">p.herc;;.+</idno> is now fine. There should now be no more <idno type="dclp-hybrid">P.Herc. left in https://github.com/DCLP/idp.data/tree/master/DCLP

I have made a number of commits to make the required corrections.

Edelweiss commented 6 years ago

files that still need repair

./63/62411.xml: P.Herc. 228, 403, 407, 1425, 1581 ./63/62425.xml: P.Herc. 495 ./63/62426.xml: P.Herc. 558 ./63/62476.xml: P.Herc. 1471

change to…

./63/62411.xml: p.herc;;228,403,407,1425,1581 ./63/62425.xml: p.herc;;495 ./63/62426.xml: p.herc;;558 ./63/62476.xml: p.herc;;1471

Edelweiss commented 6 years ago

case-sensitive search for P.Herc. in xpath tei:idno[@type='dclp-hybrid'] didn’t bring forth any further idnos of the kind

Edelweiss commented 6 years ago

https://github.com/DCLP/idp.data/tree/issue317

(in development and master)

Edelweiss commented 6 years ago

Files can be viewed on github and will be picked up with the next sync.