test issue for formatting comment on sch

Changes to sch.txt

This old/new example will be useful for the explanations that follow:

old:
.{#akaraRa#}100{#akaran2a#}^2¦ auch: mit keiner religiösen Handlung verbunden , 
A1past. S4r. 4 , 1 , 3.  [Schµ40] €3

new:
.{#akaraRa#}38{#akaraṇa#}¦ {!2!}  auch: mit keiner religiösen Handlung verbunden , 
Āpast. Śr. 4 , 1 , 3.  {part=,seq=40,type=,n=3}

change to IAST from AS

This was fairly straightforward for this dictionary, which already used modern conventions for representing Sanskrit words with Latin letters decorated with diacritics.

change of headword format

.{#akaraRa#}100{#akaran2a#}^2¦ -> .{#akaraRa#}38{#akaraṇa#}¦ {!2!}

The second {#..#} part is the original form of the headword (key2). The first {#..#} part is a representation of this original form in SLP1 transliteration, and after removing accents and other extraneous material.

The number between these two forms was , in the old form, always '100'. In the new form, this number is the Cologne record ID. The transformation from old to new was done in such a way that this ID was unchanged. However, some old records needed to be split into two or more records to more accurately reflect the entries of the print edition; in these cases, the extra records were assigned a Cologne record ID with decimal fractions.

The advantage of this coding is that it permits stability of the Cologne record ID, while permitting for the possibility of additions and deletions to the headword list. Something along these lines should probably be introduced for other dictionaries.

The third detail regards the homophone number. In the old form, this appears as '^2' and is inside the headword section. In the new form, this appears in new markup as {!2!} and is outside the headword section. The reason for moving this number from inside to outside the headword section is that it does NOT represent a homonym relative to the schmidt dictionary but rather is a reference to a homonym number in the PWK dictionary. Since the ^2-within-headword coding convention is used in other dictionaries to represent a homonym relative to the current dictionary, I felt that it was less confusing to use a different coding for Schmidt where this number has a different significance.

Recognizing Sanskrit words.

Sanskrit words (or word fragments) appear in identifiable spots in sch.txt:

key1 field, coded in SLP1 transliteration
key2 field, coded in IAST
italicized text - it seems that italicized text contains only Sanskrit words (or word fragments). This is peculiar to Schmidt, but would greatly simplify the identification of Sanskrit words with the body of entries.
Literary source abbreviations. There is currently no markup in Schmidt for literary source references. But most start with an abbreviation of the title of a work, in IAST spelling.

replacement of Page-breaks

In the old form, page breaks [PageX] were placed somewhere in the line of each entry. This random placement makes program processing more difficult. The new form of sch.txt has each page break on a separate line. Although entries in the new digitization may be spread over more than one line, the overall structure is easier to work with.

recoding of `[Schµ40] €3` section

Each old record ends with one of these exotic phrases. Such a phrase actually codes 4 pieces of information; for our example record the new coding is {part=,seq=40,type=,n=3}. The meaning of these is

part: The Schmidt dictionary has two parts, the main part and a supplement (Nachtrag In German).
Our example is from the main part, so the value of this part parameter is the empty string.
The first entry of the supplement is
```
.{#aMSuBartf#}100{#am2s4ubhartr2#}¦ m. Sonne, Kir. XV, 49. [SchNµ28032]º €1
```
which in the new coding is {part=N,seq=28032,type=º,n=1}, where the value of part is N for Nachtrag.
seq This is the sequence number assigned by Malten in the original digitization; it generally represents an enumeration of the entries in the whole book, main part and supplement. This number is very close to the current Cologne record ID; probably this seq field has no further interest, but it seems worth keeping, nonetheless.
type - In the aMSuBartf example, the [SchNµ28032]º is followed by that º symbol;

Such a marking is quite common (over 1/3 of records have such a marking); I am not sure of the significance . But it must be kept as part of the information of the dictionary.
n - this is the number of lines for the entry in the printed edition. Although the line breaks of the print edition are not preserved, this number could be useful. For instance, it could help locate the position of an entry on the printed page.

EM-DASH

The original coding of sch.txt uses a double-dash to represent a long dash and has been recoded as a unicode EM-DASH character. This occurs in about 5% of the entries. Generally, it represents a division of the entry; and we have made that assumption in the derivation of sch.xml. However, there are some false-positives, in the sense that this em-dash has some other significance. For example:

     .{#aMSa#}2{#aṃśa#}¦  1. {%kenāṃśena%} so v.a. in welchem Stücke? Daśak. 51 , 7. — 8. 
       Nenner eines Bruches. {part=,seq=3,type=,n=3}

In such instances, it would be better to use some other coding (maybe the original '--' or a single '-')
so that the EMDASH instances that remain always represent divisions.

replace ellipsis `…` character with space

Example:

old:
.{#a#}100{#a#}^4¦ m. º= {%sarvajn5o…'rhan%} , S I , 53 , 3. 
 -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2
new:
.{#a#}1.1{#a#}¦ {!4!}  m. º= {%sarvajño 'rhan%} , S I , 53 , 3. 
— Viṣṇu , H 31 , 9; Vās. 113 , 1. {part=,seq=2,type=*,n=2}

In Malten's first digitization, of Monier-Williams dictionary, the ellipsis character is used as 'glue' to connect words that form a logical unit. I called such sections 'chunks' for want of a better term. This ellipsis artifice was used in several of the earlier digitizations. However, this usage serves no purpose currently, and is confusing. Thus, the ellipsis has been replaced by the humble space character.

This ends the description of changes to the sch.txt digitization.

changes to the xml form, sch.xml

Some of the changes to the sch.txt digitization carry over into changes to the xml form of the digitization, sch.xml.

Example: 2nd entry of headword 'a'

old:
<H1><h><key1>a</key1><key2>a</key2><hom>4</hom></h>
  <body>m. º= 
<i>sarvajn5o…'rhan</i> , S I , 53 , 3. -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2</body>
<tail><L>1</L><pc>001-1</pc></tail></H1>

new: 
<H1><h><key1>a</key1><key2>a</key2></h>
<body><hom n="pwk">4</hom>  m. º= 
<i>sarvajño 'rhan</i> , S I , 53 , 3. <div>— Viṣṇu , H 31 , 9; Vās. 113 , 1. </div>
<type>*</type></body>
<tail><info seq="2" n="2"/><L>1.1</L><pc>001-1</pc></tail></H1>

`<info>` and `<type>`

These two new elements code the information in the {part=N,seq=2,type=*,n=2} section.

<info part="N" seq="2" n="2" /> codes the part, seq, and n values. Since these values are all meta-data (not part of the printed text of the entry), this <info> element is put in the <tail> of the xml record. When the 'part' value is empty string, it is omitted from attributes (as in example above).
<type>*</type> This codes the 'type' value. It is placed at the end of the <body> element. In case this type value is empty, the <type> element is omitted. Note that the type value appears as the text within the <type> element (rather than, say, as the value of some attribute); this choice was made since the value (* in our example) actually does occur within the printed text of the entry. It might have been better to put this type element somewhere at the beginning of the <body> element, rather than at the end. However, this was not viewed as a significant decision, since it was felt that the type essentially refers to the entire entry, rather than to a particular word of the entry.

Homonym `<hom>` tag

This is always coded with attribute n="pwk", as a reminder that this number refers to a particular homonym in the PWK dictionary.

`<div>`

<div> elements begin with an EM-DASH and continue to the next EM-DASH (or the end of the record). However, if the <type element is present, the last <div> ends before the <type> element, as in the example above.

other differences

As the example shows, the other changes (notably IAST instead of AS) made to sch.txt flow through to sch.xml

This ends the description of changes to the xml form.

funderburkjim / testing