[book] indicate l. if only lines differ

tuurma commented 4 years ago

Same book, same text, line numbers indicated

IGLS XXI.5 29, 1 -> IGLS XXI.5 29, 1 IGLS XXI.5 29, 2 -> ib. l. 2

MT: ... detecting line numbers in the details field can be problematic. I would suggest creating separate line field to hold it explicitly. Then I could try to automate the line number extraction on a per abbreviation basis, similar to what we did when systematizing volume numbers. Would you have any suggestions for common patterns? Final number after the comma seems to often be the line number but I've seen also entries like p. 39 no. 3, so I wonder if the number after no. is also the line? Then there are entries like A Pers. 29, 302, 972 where I don't suppose 972 is a line number?

RC: In the past, there were very strict rules governing the use of commas, essentially for distinguishing line numbers. That is no longer the case, so I am rather at a loss to suggest how to resolve this problem. I think it is probably true that the vast majority of commas relate to line numbers, but there are also strings of numbers referring to chapters separated by commas. It would have been better in retrospect if semi-colons had been used instead. Is there any way of generating a list which would not totally overwhelm us with irrelevant entries? In the two examples you cite, what follows the comma in each case is in fact a line number. But you are likely to find others which are not (e.g. J., +BJ or J., +AJ).

tuurma commented 4 years ago

Ordering the abbreviations by number of references there are:

9 abbreviation with > 1000 references (IGLS, SEG, PDura, CIIP, ChLA, RE, IG, Meimaris_Chronological_Systems)
84 > 100
128 > 50
235 > 20

I'd suggest to concentrate on the most common abbreviations to figure out what the predominant patterns are.

Initial results for IGLS show that majority of entries matching , (\d)+$) pattern (ending with , number) (bit below 3k cases out of ~9k total IGLS references could be automatically converted)

all IGLS
ending with , number ~4k total, single comma ~3k
all SEG
ending with , number 6.3 total, single comma ~2.9k
all PDura
ending with , number ~2k total, with comma only about 300 but much more variation, may require some manual checks first
all CIIP
ending with , number ~2k total, not many with comma; check the dot in entries like CIIP I (2) 842.15 Αβιδελλα
ChLA very few with commas
IG with commas majority simple to convert (650 with single 1 comma pattern); some with dots, some with no.
all Meimaris, majority has no., e.g. Meimaris, +Chronological +Systems p. 189 no. 103Ιδδος

tuurma commented 4 years ago

As a preparatory step I extended our xml template to store the line number explicitly

declare namespace tei="http://www.tei-c.org/ns/1.0";

for $bibl in collection('/db/apps/lgpn-data/data/persons')//tei:bibl[not(@type='volume')][not(tei:note[@type='line'])]
let $add := <note xmlns="http://www.tei-c.org/ns/1.0" type="line"/>
return 

    update insert $add following $bibl/tei:ref

and adjusted the input form accordingly; please note that the Linking field has been moved up and now is placed in the same row with Line

tuurma commented 4 years ago

@michaelzellmann I have prepared a conversion list, in the first instance tackling just most popular entries with simple cases that just ends with , number pattern. If you could have a glance at the conversion suggestions below if they look reasonable and let me know

IGLS

SEG

IG

michaelzellmann commented 4 years ago

Many thanks, Magdalena, the three lists look ok to me. Should I be able to see anything by clicking on the links at right? Right now I see only this error:

[cid:A84A17A2-1A1C-4F76-A010-797F0D60670C]

On Apr 22, 2020, at 12:13 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

@michaelzellmannhttps://github.com/michaelzellmann I have prepared a conversion list, in the first instance tackling just most popular entries with simple cases that just ends with , number pattern. If you could have a glance at the conversion suggestions below if they look reasonable and let me know

IGLShttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=IGLS

SEGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=SEG

IGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/tools/biblLines.xq?bibl=IG

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617712818, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHEFFZLNXIPUSXF6A7LRN3GN5ANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Thanks, I've fixed the link so it leads to the person input form.

I will run the conversion now for IGLS, SEG and IG and attach the logs here.

singlecomma-log.zip

tuurma commented 4 years ago

After running the conversion other cases containing comma but not matching the pattern of final comma and number

SEG

SEG XLVIII 1868, [1] Μαρώνις (comma and [number])
SEG XLI 1530, 8, 75 Ζώη (multiple commas)
SEG LV 1053 A, 9; B, 15 Οὐεττινιανός
SEG XLIII 1026B, D Μαρῖνος

Could you please confirm if following handling is appropriate

treat number in [] as a line number -> l. [1]
treat final comma-separated numbers as line number -> l. 8, 75
split into two bibl. entries? LV 1053 A l. 9 and LV 1053 B l. 15
leave as is, I suspect B and D are not line numbers?

tuurma commented 4 years ago

IGLS

IGLS II 466, [2] -> same as SEG case 1
IGLS XVI (1) 289, 1, 3 -> same as SEG case 2
IGLS XVII (1) 477 a, 1; b, 2 -> same as SEG case 3
IGLS III (2) 1183, 3, 21, 31 -> multiple line numbers, variant of case 2
IGLS XVII (1) 536 a, 1; b, 1; c, 2 -> multiple entries, variant of case 3

tuurma commented 4 years ago

IG very few remaining cases like IG XI (4) 772, 3, 15 (same as SEG case 2) and the rest could be handled manually

michaelzellmann commented 4 years ago

Please see below for answers between lines

On Apr 22, 2020, at 1:58 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

After running the conversion other cases containing comma but not matching the pattern of final comma and number

SEGhttp://clas-lgpn4.classics.ox.ac.uk:8080/exist/apps/lgpn-editor/modules/bibl-lines.xq?bibl=SEG

SEG XLVIII 1868, [1] Μαρώνις (comma and [number])
SEG XLI 1530, 8, 75 Ζώη (multiple commas)
SEG LV 1053 A, 9; B, 15 Οὐεττινιανός
SEG XLIII 1026B, D Μαρῖνος

Could you please confirm if following handling is appropriate

treat number in [] as a line number -> l. [1]

Correct

treat final comma-separated numbers as line number -> l. 8, 75

Correct

split into two bibl. entries? LV 1053 A l. 9 and LV 1053 B l. 15

Correct

leave as is, I suspect B and D are not line numbers?

Correct, B and D are part of the “details” and not the line number

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617763284, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHAUGYNFRA2JMIQK66TRN3SXVANCNFSM4LEEUEGA.

tuurma commented 4 years ago

As we're slowly converting database entries, I'm now working on the LaTeX generating scripts

Here's a test case for Γέμελλα, in Heliopolis we should have

(2) IGLS vi 2751, 3 (3) ib. l.4

Original bibl. entry for (3) is IGLS vi 2751, 4

michaelzellmann commented 4 years ago

Correct, thanks. I am still working through your list of the Yes / Maybe / No entries.

On Apr 24, 2020, at 11:56 AM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

As we're slowly converting database entries, I'm now working on the LaTeX generating scripts

Here's a test case for Γέμελλα, in Heliopolis we should have

(2) IGLS vi 2751, 3 (3) ib. l.4

Original bibl. entry for (3) is IGLS vi 2751, 4

[image]https://user-images.githubusercontent.com/449468/80205340-bb755a00-862a-11ea-80e9-205333040d47.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-618943675, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHH772JVE33VSX4Y4BTROFV6JANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Yes, I saw you were working in the Google doc, many thanks!

Meanwhile I have some progress with presenting ib with lines but need to test if there are no regressions in other cases

michaelzellmann commented 4 years ago

Might be worth checking with Richard but I believe there should be a space after l., i.e. here “ib. l. 4"

tuurma commented 4 years ago

Thanks, fixed

tuurma commented 4 years ago

Thanks to Michael's list I could convert further entries matching the final comma-number pattern for following abbreviations (log file attached)

 "IPalTertia", "ISyrie", "AAES", "ITyr", "IGerasa", "MUSJ", "ZDPV", "IWadi_Haggag", "YCS", "Nessana", "IJO", "Hajjar", "IPalTertia_west", "Dussaud_Macler_Mission", "IMSoueida", "SEMA", "INegev", "Lörincz", "PEQ", "DainIGLouvre", "MFO",  "Mouterde_Limes", "BCH", "ILS", "IIasos", "CIJ", "IDR", "Ovadiah_MPI", "Resafa", "FroehnerInscrLouvre", "SBF", "PMasada", "Topoi", "PferdehirtMilitärdiplome", "IGR", "KayserRecueil", "Mittmann_Beiträge", "ISmyrna", "RMD", "Clermont_Ganneau_RAO", "DOP", "IAntMaroc", "BAAL", "IAquil", "RA", "JIWE", "Pall", "Brünnow_Domaszewski_PA", "IEJ", "MendelCat", "CrowfootObjectsfromSamaria", "Old_Syriac_Inscriptions"

Here are counts of entries for each abbreviations that have line filled currently: singlecomma-Michaelslist-log.html.zip

IGLS 3599
SEG 2952
IG 661
CIIP 265
IGerasa 260
PDura 232
IWadi_Haggag 188
ITyr 170
Nessana 149
AAES 141
IMSoueida 106
ISyrie 104
IPalTertia_west 103
SEMA 61
PEQ 49
IIasos 46
DainIGLouvre 44
IDR 38
INegev 37
Dussaud_Macler_Mission 34
MUSJ 34
YCS 32
Mouterde_Limes 31
MFO 30
BCH 27
KayserRecueil 24
PferdehirtMilitärdiplome 23
ISmyrna 22
RMD 21
FroehnerInscrLouvre 21
CIJ 18
IAquil 15
IPalTertia 14
MendelCat 13
IAntMaroc 13
Mittmann_Beiträge 12
JIWE 12
PMasada 11
Clermont_Ganneau_RAO 11
ZDPV 10
SBF 10
CrowfootObjectsfromSamaria 9
Brünnow_Domaszewski_PA 8
DOP 8
IGR 7
RA 7
Ovadiah_MPI 6
Resafa 5
IEJ 5
IJO 5
ILS 4
ChLA 4
BAAL 2
Lörincz 1
Topoi 1
Old_Syriac_Inscriptions 1
Hajjar 1
Pall 1
Meimaris_Chronological_Systems 1

tuurma commented 4 years ago

After converting the single comma-number pattern matches for selected abbreviations yesterday, today I've prepared the conversion for patterns where there are multiple comma-separated numbers at the end and/or some numbers are in brackets (cases 1 and 2 as discussed here)

I've run the would-be conversion (generating new values but without applying) for a handful of most common abbreviations biblLines.pdf

Looking at these results, I'd suggest to

go ahead applying this pattern for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES"
but refrain doing so on "PDura", "PNess", "J"

There are no matches for other most common abbreviations: "ChLA", "RE", "Meimaris_Chronological_Systems", "FRA", "SchiefferACOIndexProsopogr", "DCB", "IPalTertia", "PLRE", "Justi", "IMoab", "PIR2"

michaelzellmann commented 4 years ago

Thanks, this looks ok for 1. Definitely not “J” in 2. as that is a literary text, it has no line numbers. PDura and PNess will be mostly long strings with many line numbers separated by commas, which can be done manually if not automated.

On Apr 29, 2020, at 12:19 PM, Magdalena Turska notifications@github.com<mailto:notifications@github.com> wrote:

After converting the single comma-number pattern matches for selected abbreviations yesterday, today I've prepared the conversion for patterns where there are multiple comma-separated numbers at the end and/or some numbers are in brackets (cases 1 and 2 as discussed herehttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-617763284)

I've run the would-be conversion (generating new values but without applying) for a handful of most common abbreviations biblLines.pdfhttps://github.com/eXistSolutions/LGPN/files/4551310/biblLines.pdf

Looking at these results, I'd suggest to

go ahead applying this pattern for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES"
but refrain doing so on "PDura", "PNess", "J"

There are no matches for other most common abbreviations: "ChLA", "RE", "Meimaris_Chronological_Systems", "FRA", "SchiefferACOIndexProsopogr", "DCB", "IPalTertia", "PLRE", "Justi", "IMoab", "PIR2"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eXistSolutions/LGPN/issues/284#issuecomment-621138309, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE55QHCUPNS7TCMTPTYBPELRPAEMNANCNFSM4LEEUEGA.

tuurma commented 4 years ago

Thanks for super-fast response, I will run it in the evening then (after 6pm in Oxford and after triggering backup, as usual)

tuurma commented 4 years ago

I've just ran the conversion for "IGLS", "SEG", "CIIP", "IG", "TEAD", "ISyrie", "IMnBeyrouth", "AAES", logs are attached.

Current numbers for entries with line field filled

IGLS 3660
SEG 3011
IG 696
TEAD 573
IMnBeyrouth 339
CIIP 266
IGerasa 260
IWadi_Haggag 188
ITyr 171
Nessana 149
AAES 142
ISyrie 106
IMSoueida 106
IPalTertia_west 103
SEMA 61
PEQ 49
IIasos 46
DainIGLouvre 44
IDR 38
INegev 37
Dussaud_Macler_Mission 34
MUSJ 34
YCS 32
Mouterde_Limes 31
MFO 31
BCH 27
KayserRecueil 24
PferdehirtMilitärdiplome 23
ISmyrna 22
RMD 21
FroehnerInscrLouvre 21
CIJ 18
IAquil 15
IPalTertia 14
MendelCat 13
IAntMaroc 13
Mittmann_Beiträge 12
JIWE 12
PMasada 11
Clermont_Ganneau_RAO 11
ZDPV 10
SBF 10
CrowfootObjectsfromSamaria 9
Brünnow_Domaszewski_PA 8
DOP 8
IGR 7
RA 7
Ovadiah_MPI 6
Resafa 5
IEJ 5
IJO 5
ILS 4
ChLA 4
BAAL 2
Lörincz 1
Topoi 1
Old_Syriac_Inscriptions 1
Hajjar 1
Pall 1
Meimaris_Chronological_Systems 1

finalcommaseries.pdf

finalcommaseries-log.zip

eXistSolutions / LGPN

[book] indicate l. if only lines differ #284