Open funderburkjim opened 7 years ago
This old/new example will be useful for the explanations that follow:
old:
.{#akaraRa#}100{#akaran2a#}^2¦ auch: mit keiner religiösen Handlung verbunden ,
A1past. S4r. 4 , 1 , 3. [Schµ40] €3
new:
.{#akaraRa#}38{#akaraṇa#}¦ {!2!} auch: mit keiner religiösen Handlung verbunden ,
Āpast. Śr. 4 , 1 , 3. {part=,seq=40,type=,n=3}
This was fairly straightforward for this dictionary, which already used modern conventions for representing Sanskrit words with Latin letters decorated with diacritics.
.{#akaraRa#}100{#akaran2a#}^2¦
-> .{#akaraRa#}38{#akaraṇa#}¦ {!2!}
The second {#..#} part is the original form of the headword (key2). The first {#..#} part is a representation of this original form in SLP1 transliteration, and after removing accents and other extraneous material.
The number between these two forms was , in the old form, always '100'. In the new form, this number is the Cologne record ID. The transformation from old to new was done in such a way that this ID was unchanged. However, some old records needed to be split into two or more records to more accurately reflect the entries of the print edition; in these cases, the extra records were assigned a Cologne record ID with decimal fractions.
The advantage of this coding is that it permits stability of the Cologne record ID, while permitting for the possibility of additions and deletions to the headword list. Something along these lines should probably be introduced for other dictionaries.
The third detail regards the homophone number. In the old form, this appears as '^2' and is inside the headword section. In the new form, this appears in new markup as {!2!}
and is outside the headword section. The reason for moving this number from inside to outside the headword section is that it does NOT represent a homonym relative to the schmidt dictionary but rather is a reference to a homonym number in the PWK dictionary. Since the ^2
-within-headword coding convention is used in other dictionaries to represent a homonym relative to the current dictionary, I felt that it was less confusing to use a different coding for Schmidt where this number has a different significance.
Sanskrit words (or word fragments) appear in identifiable spots in sch.txt:
In the old form, page breaks [PageX]
were placed somewhere in the line of each entry. This random placement makes program processing more difficult. The new form of sch.txt has each page break on
a separate line. Although entries in the new digitization may be spread over more than one line, the overall structure is easier to work with.
[Schµ40] €3
sectionEach old record ends with one of these exotic phrases. Such a phrase actually codes 4 pieces of information; for our example record the new coding is {part=,seq=40,type=,n=3}
.
The meaning of these is
Nachtrag
In German).part
parameter is the empty string..{#aMSuBartf#}100{#am2s4ubhartr2#}¦ m. Sonne, Kir. XV, 49. [SchNµ28032]º €1
which in the new coding is {part=N,seq=28032,type=º,n=1}
, where the value of part
is N
for Nachtrag.
seq
field has no further interest, but it seems
worth keeping, nonetheless.type - In the aMSuBartf
example, the [SchNµ28032]º
is followed by that º
symbol;
Such a marking is quite common (over 1/3 of records have such a marking);
I am not sure of the significance . But it must be kept as part of the information of the dictionary.
The original coding of sch.txt uses a double-dash to represent a long dash and has been recoded as a unicode EM-DASH character. This occurs in about 5% of the entries. Generally, it represents a division of the entry; and we have made that assumption in the derivation of sch.xml. However, there are some false-positives, in the sense that this em-dash has some other significance. For example:
.{#aMSa#}2{#aṃśa#}¦ 1. {%kenāṃśena%} so v.a. in welchem Stücke? Daśak. 51 , 7. — 8.
Nenner eines Bruches. {part=,seq=3,type=,n=3}
In such instances, it would be better to use some other coding (maybe the original '--' or a single '-')
so that the EMDASH instances that remain always represent divisions.
…
character with spaceExample:
old:
.{#a#}100{#a#}^4¦ m. º= {%sarvajn5o…'rhan%} , S I , 53 , 3.
-- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2
new:
.{#a#}1.1{#a#}¦ {!4!} m. º= {%sarvajño 'rhan%} , S I , 53 , 3.
— Viṣṇu , H 31 , 9; Vās. 113 , 1. {part=,seq=2,type=*,n=2}
In Malten's first digitization, of Monier-Williams dictionary, the ellipsis character is used as 'glue' to connect words that form a logical unit. I called such sections 'chunks' for want of a better term. This ellipsis artifice was used in several of the earlier digitizations. However, this usage serves no purpose currently, and is confusing. Thus, the ellipsis has been replaced by the humble space character.
This ends the description of changes to the sch.txt digitization.
Some of the changes to the sch.txt digitization carry over into changes to the xml form of the digitization, sch.xml.
Example: 2nd entry of headword 'a'
old:
<H1><h><key1>a</key1><key2>a</key2><hom>4</hom></h>
<body>m. º=
<i>sarvajn5o…'rhan</i> , S I , 53 , 3. -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2</body>
<tail><L>1</L><pc>001-1</pc></tail></H1>
new:
<H1><h><key1>a</key1><key2>a</key2></h>
<body><hom n="pwk">4</hom> m. º=
<i>sarvajño 'rhan</i> , S I , 53 , 3. <div>— Viṣṇu , H 31 , 9; Vās. 113 , 1. </div>
<type>*</type></body>
<tail><info seq="2" n="2"/><L>1.1</L><pc>001-1</pc></tail></H1>
<info>
and <type>
These two new elements code the information in the {part=N,seq=2,type=*,n=2}
section.
<info part="N" seq="2" n="2" />
codes the part, seq, and n values. Since these values are all
meta-data (not part of the printed text of the entry), this <info>
element is put in the <tail>
of the
xml record. When the 'part' value is empty string, it is omitted from attributes (as in example above).<type>*</type>
This codes the 'type' value. It is placed at the end of the <body>
element.
In case this type value is empty, the <type>
element is omitted.
Note that the type value appears as the text within the <type>
element (rather than, say, as the
value of some attribute); this choice was made since the value (*
in our example) actually does
occur within the printed text of the entry. It might have been better to put this type element
somewhere at the beginning of the <body>
element, rather than at the end. However, this was
not viewed as a significant decision, since it was felt that the type
essentially refers to the entire
entry, rather than to a particular word of the entry.<hom>
tagThis is always coded with attribute n="pwk"
, as a reminder that this number refers to a particular homonym in the PWK dictionary.
<div>
<div>
elements begin with an EM-DASH and continue to the next EM-DASH (or the end of the record). However, if the <type
element is present, the last <div>
ends before the <type>
element, as in the example above.
As the example shows, the other changes (notably IAST instead of AS) made to sch.txt flow through to sch.xml
This ends the description of changes to the xml form.
Documentation of changes to Schmidt digitization
Most of the features of the revised digitization (sch.txt) are described here: sch-meta2.txt.
The revised document type definition for the xml form (sch.xml) is here: sch.dtd.