eXistSolutions / LGPN

1 stars 0 forks source link

[book] indicate l. if only lines differ #284

Open tuurma opened 4 years ago

tuurma commented 4 years ago

Same book, same text, line numbers indicated

IGLS XXI.5 29, 1 -> IGLS XXI.5 29, 1 IGLS XXI.5 29, 2 -> ib. l. 2

image

MT: ... detecting line numbers in the details field can be problematic. I would suggest creating separate line field to hold it explicitly. Then I could try to automate the line number extraction on a per abbreviation basis, similar to what we did when systematizing volume numbers. Would you have any suggestions for common patterns? Final number after the comma seems to often be the line number but I've seen also entries like p. 39 no. 3, so I wonder if the number after no. is also the line? Then there are entries like A Pers. 29, 302, 972 where I don't suppose 972 is a line number?

RC: In the past, there were very strict rules governing the use of commas, essentially for distinguishing line numbers. That is no longer the case, so I am rather at a loss to suggest how to resolve this problem. I think it is probably true that the vast majority of commas relate to line numbers, but there are also strings of numbers referring to chapters separated by commas. It would have been better in retrospect if semi-colons had been used instead. Is there any way of generating a list which would not totally overwhelm us with irrelevant entries? In the two examples you cite, what follows the comma in each case is in fact a line number. But you are likely to find others which are not (e.g. J., +BJ or J., +AJ).

tuurma commented 4 years ago

Ordering the abbreviations by number of references there are:

I'd suggest to concentrate on the most common abbreviations to figure out what the predominant patterns are.

Initial results for IGLS show that majority of entries matching , (\d)+$) pattern (ending with , number) (bit below 3k cases out of ~9k total IGLS references could be automatically converted)

tuurma commented 4 years ago

As a preparatory step I extended our xml template to store the line number explicitly

declare namespace tei="http://www.tei-c.org/ns/1.0";

for $bibl in collection('/db/apps/lgpn-data/data/persons')//tei:bibl[not(@type='volume')][not(tei:note[@type='line'])]
let $add := <note xmlns="http://www.tei-c.org/ns/1.0" type="line"/>
return 

    update insert $add following $bibl/tei:ref

and adjusted the input form accordingly; please note that the Linking field has been moved up and now is placed in the same row with Line

image

tuurma commented 4 years ago

@michaelzellmann I have prepared a conversion list, in the first instance tackling just most popular entries with simple cases that just ends with , number pattern. If you could have a glance at the conversion suggestions below if they look reasonable and let me know

IGLS

SEG

IG

michaelzellmann commented 4 years ago

Many thanks, Magdalena, the three lists look ok to me. Should I be able to see anything by clicking on the links at right? Right now I see only this error:

[cid:A84A17A2-1A1C-4F76-A010-797F0D60670C]

On Apr 22, 2020, at 12:13 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

@michaelzellmannhttps://github.com/michaelzellmann I have prepared a conversion list, in the first instance tackling just most popular entries with simple cases that just ends with , number pattern. If you could have a glance at the conversion suggestions below if they look reasonable and let me know

IGLShttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=IGLS

SEGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=SEG

IGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=IG

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617712818, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHEFFZLNXIPUSXF6A7LRN3GN5ANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Thanks, I've fixed the link so it leads to the person input form.

I will run the conversion now for IGLS, SEG and IG and attach the logs here.

singlecomma-log.zip

tuurma commented 4 years ago

After running the conversion other cases containing comma but not matching the pattern of final comma and number

SEG

  1. SEG XLVIII 1868, [1] Μαρώνις (comma and [number])
  2. SEG XLI 1530, 8, 75 Ζώη (multiple commas)
  3. SEG LV 1053 A, 9; B, 15 Οὐεττινιανός
  4. SEG XLIII 1026B, D Μαρῖνος

Could you please confirm if following handling is appropriate

  1. treat number in [] as a line number -> l. [1]
  2. treat final comma-separated numbers as line number -> l. 8, 75
  3. split into two bibl. entries? LV 1053 A l. 9 and LV 1053 B l. 15
  4. leave as is, I suspect B and D are not line numbers?
tuurma commented 4 years ago

IGLS

  1. IGLS II 466, [2] -> same as SEG case 1
  2. IGLS XVI (1) 289, 1, 3 -> same as SEG case 2
  3. IGLS XVII (1) 477 a, 1; b, 2 -> same as SEG case 3
  4. IGLS III (2) 1183, 3, 21, 31 -> multiple line numbers, variant of case 2
  5. IGLS XVII (1) 536 a, 1; b, 1; c, 2 -> multiple entries, variant of case 3
tuurma commented 4 years ago

IG very few remaining cases like IG XI (4) 772, 3, 15 (same as SEG case 2) and the rest could be handled manually

michaelzellmann commented 4 years ago

Please see below for answers between lines

On Apr 22, 2020, at 1:58 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

After running the conversion other cases containing comma but not matching the pattern of final comma and number

SEGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/bibl-lines.xq?bibl=SEG

  1. SEG XLVIII 1868, [1] Μαρώνις (comma and [number])
  2. SEG XLI 1530, 8, 75 Ζώη (multiple commas)
  3. SEG LV 1053 A, 9; B, 15 Οὐεττινιανός
  4. SEG XLIII 1026B, D Μαρῖνος

Could you please confirm if following handling is appropriate

  1. treat number in [] as a line number -> l. [1]

Correct

  1. treat final comma-separated numbers as line number -> l. 8, 75

Correct

  1. split into two bibl. entries? LV 1053 A l. 9 and LV 1053 B l. 15

Correct

  1. leave as is, I suspect B and D are not line numbers?

Correct, B and D are part of the “details” and not the line number

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617763284, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHAUGYNFRA2JMIQK66TRN3SXVANCNFSM4LEEUEGA.

tuurma commented 4 years ago

As we're slowly converting database entries, I'm now working on the LaTeX generating scripts

Here's a test case for Γέμελλα, in Heliopolis we should have

(2) IGLS vi 2751, 3 (3) ib. l.4

Original bibl. entry for (3) is IGLS vi 2751, 4

image

michaelzellmann commented 4 years ago

Correct, thanks. I am still working through your list of the Yes / Maybe / No entries.

On Apr 24, 2020, at 11:56 AM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

As we're slowly converting database entries, I'm now working on the LaTeX generating scripts

Here's a test case for Γέμελλα, in Heliopolis we should have

(2) IGLS vi 2751, 3 (3) ib. l.4

Original bibl. entry for (3) is IGLS vi 2751, 4

[image]https://user-images.githubusercontent.com/449468/80205340-bb755a00-862a-11ea-80e9-205333040d47.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-618943675, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHH772JVE33VSX4Y4BTROFV6JANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Yes, I saw you were working in the Google doc, many thanks!

Meanwhile I have some progress with presenting ib with lines but need to test if there are no regressions in other cases

image

michaelzellmann commented 4 years ago

Might be worth checking with Richard but I believe there should be a space after l., i.e. here “ib. l. 4"

tuurma commented 4 years ago

Thanks, fixed

image

tuurma commented 4 years ago

Thanks to Michael's list I could convert further entries matching the final comma-number pattern for following abbreviations (log file attached)

 "IPalTertia", "ISyrie", "AAES", "ITyr", "IGerasa", "MUSJ", "ZDPV", "IWadi_Haggag", "YCS", "Nessana", "IJO", "Hajjar", "IPalTertia_west", "Dussaud_Macler_Mission", "IMSoueida", "SEMA", "INegev", "Lörincz", "PEQ", "DainIGLouvre", "MFO",  "Mouterde_Limes", "BCH", "ILS", "IIasos", "CIJ", "IDR", "Ovadiah_MPI", "Resafa", "FroehnerInscrLouvre", "SBF", "PMasada", "Topoi", "PferdehirtMilitärdiplome", "IGR", "KayserRecueil", "Mittmann_Beiträge", "ISmyrna", "RMD", "Clermont_Ganneau_RAO", "DOP", "IAntMaroc", "BAAL", "IAquil", "RA", "JIWE", "Pall", "Brünnow_Domaszewski_PA", "IEJ", "MendelCat", "CrowfootObjectsfromSamaria", "Old_Syriac_Inscriptions"

Here are counts of entries for each abbreviations that have line filled currently: singlecomma-Michaelslist-log.html.zip

tuurma commented 4 years ago

After converting the single comma-number pattern matches for selected abbreviations yesterday, today I've prepared the conversion for patterns where there are multiple comma-separated numbers at the end and/or some numbers are in brackets (cases 1 and 2 as discussed here)

I've run the would-be conversion (generating new values but without applying) for a handful of most common abbreviations biblLines.pdf

Looking at these results, I'd suggest to

  1. go ahead applying this pattern for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES"
  2. but refrain doing so on "PDura", "PNess", "J"

There are no matches for other most common abbreviations: "ChLA", "RE", "Meimaris_Chronological_Systems", "FRA", "SchiefferACOIndexProsopogr", "DCB", "IPalTertia", "PLRE", "Justi", "IMoab", "PIR2"

michaelzellmann commented 4 years ago

Thanks, this looks ok for 1. Definitely not “J” in 2. as that is a literary text, it has no line numbers. PDura and PNess will be mostly long strings with many line numbers separated by commas, which can be done manually if not automated.

On Apr 29, 2020, at 12:19 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

After converting the single comma-number pattern matches for selected abbreviations yesterday, today I've prepared the conversion for patterns where there are multiple comma-separated numbers at the end and/or some numbers are in brackets (cases 1 and 2 as discussed herehttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617763284)

I've run the would-be conversion (generating new values but without applying) for a handful of most common abbreviations biblLines.pdfhttps://github.com/eXistSolutions/LGPN/files/4551310/biblLines.pdf

Looking at these results, I'd suggest to

  1. go ahead applying this pattern for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES"
  2. but refrain doing so on "PDura", "PNess", "J"

There are no matches for other most common abbreviations: "ChLA", "RE", "Meimaris_Chronological_Systems", "FRA", "SchiefferACOIndexProsopogr", "DCB", "IPalTertia", "PLRE", "Justi", "IMoab", "PIR2"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-621138309, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHCUPNS7TCMTPTYBPELRPAEMNANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Thanks for super-fast response, I will run it in the evening then (after 6pm in Oxford and after triggering backup, as usual)

tuurma commented 4 years ago

I've just ran the conversion for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES", logs are attached.

Current numbers for entries with line field filled

finalcommaseries.pdf

finalcommaseries-log.zip