funderburkjim / testing

For testing various features of github. Nothing important here.
0 stars 0 forks source link

alternate headwords for acc #30

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

We have previously added alternate headwords for dictionaries AP90 and SKD.

We are now wanting to add alternate headwords for ACC, and other dictionaries.

ACC.txt has been enchanced to contain a meta line with L-codes. This will make the process different than that used for AP90 and SKD.

funderburkjim commented 7 years ago

parseheadline.py

This is a nice routine Dhaval developed to parse strings of the form <key1>val1...<keyn>valn.

It returns a dictionary d, d['key']=val1, etc.

The routine is quite general.

It is used to parse the meta line of acc.txt.

I think it can also be used below to parse acc_hwextra.txt.

funderburkjim commented 7 years ago

L-number identify entries

Our primary understanding of dictionary structure is that a dictionary is composed of a sequence of entries, and each entry has as its primary identity a headword.

The headword in general does not completely specify an entry, because there may be multiple entries with the same headword. This fact is the reason that it is necessary to have a separate entry identifier (which we have called the L number). The L-number uniquely specifies an entry. The L-numbers for a dictionary should have further properties:

funderburkjim commented 7 years ago

alternate headwords per entry

In many dictionaries, a given entry will have two or more associated headwords. For example, in SKD dictionary we see kube(ve)raH, whose printed version starts out as: image

Our interpretation is that there are two alternate spellings for the headword of this entry: kuberaH and kuveraH.

In the digitization skd.txt there is only one entry.

We think of the 'kuberaH' spelling as the primary headword spelling for this entry.

In order to represent the fact that this entry should be accessible with the other spelling, our approach is to represent this alternate within the skdhw2.txt file associated with the digitization:

2-144:kuberaH:71062,71072:8094
2-144:kuveraH:71062,71072:8094.01:alt

In this case, the L-number associated with the primary headword spelling is 8094.

The 'synthetic' L-number associated with the alternate headword spelling is 8094.01 and we have qualified this with the ;alt property to indicate that this 'extra' headword is of the type 'alternate'.

The pair if numbers 71062,71072 represents the range of lines in the digitization skd.txt that represents the entry. Notice:

funderburkjim commented 7 years ago

subheadwords: extra headwords which are not alternates of the primary headword

In many dictionaries, for some entries there are sub-headwords. Think of PWG which has the prefixed-verb forms nestled within the entry for a non-prefixed root. Or think of STC or other dictionaries where there are compounds of the primary headword (e.g., look up 'deva' in STC where compounds like 'deva-karman, devarzi, etc` appear as subheadwords.

We have not yet attempted to represent these subheadwords. For instance, you can't directly access prefixed root avagam in PWG, since it appears as a subheadword (-- ava) under gam.

The above scheme used for representing alternate headwords in the xxxhw2.txt file is sufficiently general to be able to represent subheadwords as alternate pointers to the entry under which the subheadword appears. i.e., pwghw2.txt could represent :

2-0666:gam:46226,46359:21814
2-0666:avagam:46226,46359:28184.15;sub    <<< .15 is just a guess as to which subheadword this is.

This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam occurs; but it would at least represent that avagam is mentioned somewhere in the gam entry.

funderburkjim commented 7 years ago

data sources for acchw2.txt

To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.

To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.

The non-alternates are derived directly for acc, from the meta-line of the acc.txt digitization.

acchw2.txt to be constructed from two data sources:

funderburkjim commented 7 years ago

fields of acchw2.txt

There is an optional additional field, reserved for the 'type' code of an extra headword.
The fields are:

funderburkjim commented 7 years ago

example of acchw2.txt for primary entry

First few lines of acc.txt: Note: we show the acc.txt line numbers - they are not part of acc.txt

000001 [Page1-001-a+ 36]
000002 <H>CATALOGUS CATALOGORUM. 
000003 <L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA
000004 {#aMSadaSA#}¦ jy. Rice 28.
000005 <LEND>

First primary record of acchw2.txt as derived from lines 3-5 of acc.txt:


1-001,1:aMSadaSA:3,5:1

pagecol = 1-001,1   (copied from `<pc>` field of meta line)
key1 = aMSadaSA  (copied from `<k1>` field of meta line)
line1,line2 = 3,5  (from line numbers of entry within acc.txt)
> Note:  this includes the opening meta line (line3) and the closing meta line (line5)
L = 1   (copied from `<L>` field of meta line)
funderburkjim commented 7 years ago

example of acchw2.txt for an alternate headword

Although we have not yet developed lists of alternate headwords for acc, an examination of the first few lines of acc.txt provides an example of a likely alternate headword,

000039 <L>12<pc>1-001,1<k1>akzapAda<k2>akzapAda
000040 {#akzapAda#}¦ or {#akzacaraRa,#} a name of Gautama, the philo-
000041 <>sopher, Hall p. 20.
000042 <LEND>

First, the acchw2.txt entry for the primary headword will be:

1-001,1:akzapAda:39,42:12

The alternate headword is akzacaraRa.

To make the acchw2.txt entry for the alternate hw , we assume that file acc_hwextra.txt has this line (which we format similarly to the meta lines)

<LP>12<key1P>akzapAda<L>12.1<key1>akzacaraRa<key2>akzacaraRa<type>alt

The LP field value (12) is matched against the primary L-numbers, so we know the primary acchw2 record is 1-001,1:akzapAda:39,42:12 as shown above.

Thus the acchw2 record for the alternate will have fields:

Putting these fields together gives the acchw2.txt record:

1-001,1:akzacaraRa:39,42:12.1:alt

In terms of positioning of this alternate record within the sequence of records of acchw2, it will go in numerical order of L number

1-001,1:akzapAda:39,42:12
1-001,1:akzacaraRa:39,42:12.1:alt
1-001,1:akzamAlApratizWA:43,45:13
drdhaval2785 commented 7 years ago

1-001,1:aMSadaSA:3,5:1

Would it not make sense to exclude meta lines from the startline and endline? The data is included in 4th line only. 3 and 5 are metadata. If this is kept like below, downstream programs will need only minimal changes.

1-001,1:aMSadaSA:4,4:1

drdhaval2785 commented 7 years ago

@funderburkjim, rest all is as apt as possible.

gasyoun commented 7 years ago

previously added alternate headwords for dictionaries AP90 and SKD

Hmm, none for VCP?

The L-numbers should be unchanging with respect to corrections in the dictionary.

Latest update, was not always so.

This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam occurs

What exactly do you mean? Same as we order by L's, we can order here and always will the inherit order remain. Or you mean not the order, but the grammar data or a subentry for a subheadword?

Where and how do we note that there is a correction entry? Like for PWG 118042 and 118043 is an addition / correction to 872. I would add a markup for that as well at this stage, so we do not have to return to it again. See aja. So I would not limit to

type code: the type of this extra headword pri = primary alt = alternate headword sub (?) = sub headword

but add

cor = correction / addition

funderburkjim commented 7 years ago

Would it not make sense to exclude meta lines from the startline and endline ?

I think it makes more sense to INCLUDE the meta lines from startline and endline.

Main programming reason is that some downstream programs will need to use the <L> meta line. Case in point is make_xml.py. It reads each line from acchw2.txt, and extracts lines from startline to endline (inclusive) of acc.txt. Then from these extracted lines it creates an xml record for acc.xml. In this construction, make_xml.py definitely needs the <L> meta line, as it uses all the fields of that line to construct various parts of the <head> and <tail> of the xml record. Then, it uses the non-meta lines from the acc.txt entry to construct the <body> of the xml record.

Second lesser reason is one of conceptual simplicity: the startline-endline should point to the entire scope of lines in acc.txt that pertain to the given 'L' number of the acchw2.txt record.

There might, as the question suggests, be some downstream programs that have no interest in the meta line . A parsing routine might serve as intermediary to make life easy for all downstream programs, whether they are interested in meta line or not.

Maybe we name the function 'parsedig' (for parsedigitization)

funderburkjim commented 7 years ago

none for VCP?

We've done some preparatory work on identifying alternate headwords, but none of this has been installed thus far.

@gasyoun BTW: Did you ever show the VCP alternate headword UI to Radha? ref

funderburkjim commented 7 years ago

also have 'cor' ?

details of the example

Current pwghw2.txt records for aja (revised to show L-num)

1-0066:aja:1833,1836:872    << mentioned in comment above
1-0066:aja:1837,1838:873
5-0956:aja:134966,134967:62799
5-0956:aja:134968,134969:62800
7-1689:aja:254891,254892:118042  <<
7-1689:aja:254893,254894:118043  <<

The suggested revision to pwghw2.txt:

1-0066:aja:1833,1836:872    << no change
7-1689:aja:254891,254892:118042:cor
7-1689:aja:254893,254894:118043:cor

The suggestion is to add the metadata 'cor' to those last two records. To be useful, we would need a reference to the record being corrected, maybe via L-number.

1-0066:aja:1833,1836:872  
7-1689:aja:254891,254892:118042:cor,872
7-1689:aja:254893,254894:118043:cor,872

Another place to put such meta information would be within the pwg.txt entries for those two records. Current form.

254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17; 
              vgl. R2V. ANUKR.¤}
254892 
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤}
254894 

Possible form (using pseudo xml)

254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17; 
              vgl. R2V. ANUKR.¤} <cor n=872>
254892 
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤} <cor n=872>
254894 

Such meta information would have potential utility. Another type of meta information might be the 'fehlerhafter` type.

My intuition is 'cor'-type meta data might be better embedded in pwg.txt than in pwghw2.txt.