funderburkjim / testing

For testing various features of github. Nothing important here.
0 stars 0 forks source link

test issue for formatting comment on sch #29

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

Documentation of changes to Schmidt digitization

Most of the features of the revised digitization (sch.txt) are described here: sch-meta2.txt.

The revised document type definition for the xml form (sch.xml) is here: sch.dtd.

funderburkjim commented 7 years ago

Changes to sch.txt

This old/new example will be useful for the explanations that follow:

old:
.{#akaraRa#}100{#akaran2a#}^2¦ auch: mit keiner religiösen Handlung verbunden , 
A1past. S4r. 4 , 1 , 3.  [Schµ40] €3

new:
.{#akaraRa#}38{#akaraṇa#}¦ {!2!}  auch: mit keiner religiösen Handlung verbunden , 
Āpast. Śr. 4 , 1 , 3.  {part=,seq=40,type=,n=3}

change to IAST from AS

This was fairly straightforward for this dictionary, which already used modern conventions for representing Sanskrit words with Latin letters decorated with diacritics.

change of headword format

.{#akaraRa#}100{#akaran2a#}^2¦ -> .{#akaraRa#}38{#akaraṇa#}¦ {!2!}

The second {#..#} part is the original form of the headword (key2). The first {#..#} part is a representation of this original form in SLP1 transliteration, and after removing accents and other extraneous material.

The number between these two forms was , in the old form, always '100'. In the new form, this number is the Cologne record ID. The transformation from old to new was done in such a way that this ID was unchanged. However, some old records needed to be split into two or more records to more accurately reflect the entries of the print edition; in these cases, the extra records were assigned a Cologne record ID with decimal fractions.

The advantage of this coding is that it permits stability of the Cologne record ID, while permitting for the possibility of additions and deletions to the headword list. Something along these lines should probably be introduced for other dictionaries.

The third detail regards the homophone number. In the old form, this appears as '^2' and is inside the headword section. In the new form, this appears in new markup as {!2!} and is outside the headword section. The reason for moving this number from inside to outside the headword section is that it does NOT represent a homonym relative to the schmidt dictionary but rather is a reference to a homonym number in the PWK dictionary. Since the ^2-within-headword coding convention is used in other dictionaries to represent a homonym relative to the current dictionary, I felt that it was less confusing to use a different coding for Schmidt where this number has a different significance.

Recognizing Sanskrit words.

Sanskrit words (or word fragments) appear in identifiable spots in sch.txt:

replacement of Page-breaks

In the old form, page breaks [PageX] were placed somewhere in the line of each entry. This random placement makes program processing more difficult. The new form of sch.txt has each page break on a separate line. Although entries in the new digitization may be spread over more than one line, the overall structure is easier to work with.

recoding of [Schµ40] €3 section

Each old record ends with one of these exotic phrases. Such a phrase actually codes 4 pieces of information; for our example record the new coding is {part=,seq=40,type=,n=3}. The meaning of these is

EM-DASH

The original coding of sch.txt uses a double-dash to represent a long dash and has been recoded as a unicode EM-DASH character. This occurs in about 5% of the entries. Generally, it represents a division of the entry; and we have made that assumption in the derivation of sch.xml. However, there are some false-positives, in the sense that this em-dash has some other significance. For example:

     .{#aMSa#}2{#aṃśa#}¦  1. {%kenāṃśena%} so v.a. in welchem Stücke? Daśak. 51 , 7. — 8. 
       Nenner eines Bruches. {part=,seq=3,type=,n=3}
In such instances, it would be better to use some other coding (maybe the original '--' or a single '-')
so that the EMDASH instances that remain always represent divisions.

replace ellipsis character with space

Example:

old:
.{#a#}100{#a#}^4¦ m. º= {%sarvajn5o…'rhan%} , S I , 53 , 3. 
 -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2
new:
.{#a#}1.1{#a#}¦ {!4!}  m. º= {%sarvajño 'rhan%} , S I , 53 , 3. 
— Viṣṇu , H 31 , 9; Vās. 113 , 1. {part=,seq=2,type=*,n=2}

In Malten's first digitization, of Monier-Williams dictionary, the ellipsis character is used as 'glue' to connect words that form a logical unit. I called such sections 'chunks' for want of a better term. This ellipsis artifice was used in several of the earlier digitizations. However, this usage serves no purpose currently, and is confusing. Thus, the ellipsis has been replaced by the humble space character.

This ends the description of changes to the sch.txt digitization.

funderburkjim commented 7 years ago

changes to the xml form, sch.xml

Some of the changes to the sch.txt digitization carry over into changes to the xml form of the digitization, sch.xml.

Example: 2nd entry of headword 'a'

old:
<H1><h><key1>a</key1><key2>a</key2><hom>4</hom></h>
  <body>m. º= 
<i>sarvajn5o…'rhan</i> , S I , 53 , 3. -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2</body>
<tail><L>1</L><pc>001-1</pc></tail></H1>

new: 
<H1><h><key1>a</key1><key2>a</key2></h>
<body><hom n="pwk">4</hom>  m. º= 
<i>sarvajño 'rhan</i> , S I , 53 , 3. <div>— Viṣṇu , H 31 , 9; Vās. 113 , 1. </div>
<type>*</type></body>
<tail><info seq="2" n="2"/><L>1.1</L><pc>001-1</pc></tail></H1>

<info> and <type>

These two new elements code the information in the {part=N,seq=2,type=*,n=2} section.

Homonym <hom> tag

This is always coded with attribute n="pwk", as a reminder that this number refers to a particular homonym in the PWK dictionary.

<div>

<div> elements begin with an EM-DASH and continue to the next EM-DASH (or the end of the record). However, if the <type element is present, the last <div> ends before the <type> element, as in the example above.

other differences

As the example shows, the other changes (notably IAST instead of AS) made to sch.txt flow through to sch.xml

This ends the description of changes to the xml form.