Open funderburkjim opened 7 years ago
This is a nice routine Dhaval developed to parse strings of the form <key1>val1...<keyn>valn
.
It returns a dictionary d, d['key']=val1, etc.
The routine is quite general.
It is used to parse the meta line of acc.txt.
I think it can also be used below to parse acc_hwextra.txt.
Our primary understanding of dictionary structure is that a dictionary is composed of a sequence of entries, and each entry has as its primary identity a headword.
The headword in general does not completely specify an entry, because there may be multiple entries
with the same headword. This fact is the reason that it is necessary to have a separate entry
identifier (which we have called the L
number). The L-number uniquely specifies an entry. The
L-numbers for a dictionary should have further properties:
In many dictionaries, a given entry will have two or more associated headwords.
For example, in SKD dictionary we see kube(ve)raH
, whose printed version starts out as:
Our interpretation is that there are two alternate spellings for the headword of this entry:
kuberaH
and kuveraH
.
In the digitization skd.txt there is only one entry.
We think of the 'kuberaH' spelling as the primary headword spelling for this entry.
In order to represent the fact that this entry should be accessible with the other spelling, our approach is to represent this alternate within the skdhw2.txt file associated with the digitization:
2-144:kuberaH:71062,71072:8094
2-144:kuveraH:71062,71072:8094.01:alt
In this case, the L-number associated with the primary headword spelling is 8094.
The 'synthetic' L-number associated with the alternate headword spelling is 8094.01 and we have
qualified this with the ;alt
property to indicate that this 'extra' headword is of the type 'alternate'.
The pair if numbers 71062,71072
represents the range of lines in the digitization skd.txt that
represents the entry. Notice:
In many dictionaries, for some entries there are sub-headwords. Think of PWG which has the prefixed-verb forms nestled within the entry for a non-prefixed root. Or think of STC or other dictionaries where there are compounds of the primary headword (e.g., look up 'deva' in STC where compounds like 'deva-karman, devarzi, etc` appear as subheadwords.
We have not yet attempted to represent these subheadwords. For instance, you can't directly access
prefixed root avagam
in PWG, since it appears as a subheadword (-- ava) under gam
.
The above scheme used for representing alternate headwords in the xxxhw2.txt file is sufficiently general to be able to represent subheadwords as alternate pointers to the entry under which the subheadword appears. i.e., pwghw2.txt could represent :
2-0666:gam:46226,46359:21814
2-0666:avagam:46226,46359:28184.15;sub <<< .15 is just a guess as to which subheadword this is.
This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam
occurs; but it would at least represent that avagam
is mentioned somewhere in the gam
entry.
To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.
To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.
The non-alternates are derived directly for acc, from the meta-line of the acc.txt digitization.
acchw2.txt to be constructed from two data sources:
There is an optional additional field, reserved for the 'type' code of an extra headword.
The fields are:
First few lines of acc.txt: Note: we show the acc.txt line numbers - they are not part of acc.txt
000001 [Page1-001-a+ 36]
000002 <H>CATALOGUS CATALOGORUM.
000003 <L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA
000004 {#aMSadaSA#}¦ jy. Rice 28.
000005 <LEND>
First primary record of acchw2.txt as derived from lines 3-5 of acc.txt:
1-001,1:aMSadaSA:3,5:1
pagecol = 1-001,1 (copied from `<pc>` field of meta line)
key1 = aMSadaSA (copied from `<k1>` field of meta line)
line1,line2 = 3,5 (from line numbers of entry within acc.txt)
> Note: this includes the opening meta line (line3) and the closing meta line (line5)
L = 1 (copied from `<L>` field of meta line)
Although we have not yet developed lists of alternate headwords for acc, an examination of the first few lines of acc.txt provides an example of a likely alternate headword,
000039 <L>12<pc>1-001,1<k1>akzapAda<k2>akzapAda
000040 {#akzapAda#}¦ or {#akzacaraRa,#} a name of Gautama, the philo-
000041 <>sopher, Hall p. 20.
000042 <LEND>
First, the acchw2.txt entry for the primary headword will be:
1-001,1:akzapAda:39,42:12
The alternate headword is akzacaraRa
.
To make the acchw2.txt entry for the alternate hw , we assume that file acc_hwextra.txt has this line (which we format similarly to the meta lines)
<LP>12<key1P>akzapAda<L>12.1<key1>akzacaraRa<key2>akzacaraRa<type>alt
The LP field value (12) is matched against the primary L-numbers, so we know the primary acchw2 record is 1-001,1:akzapAda:39,42:12
as shown above.
Thus the acchw2 record for the alternate will have fields:
Putting these fields together gives the acchw2.txt record:
1-001,1:akzacaraRa:39,42:12.1:alt
In terms of positioning of this alternate record within the sequence of records of acchw2, it will go in numerical order of L number
1-001,1:akzapAda:39,42:12
1-001,1:akzacaraRa:39,42:12.1:alt
1-001,1:akzamAlApratizWA:43,45:13
1-001,1:aMSadaSA:3,5:1
Would it not make sense to exclude meta lines from the startline and endline? The data is included in 4th line only. 3 and 5 are metadata. If this is kept like below, downstream programs will need only minimal changes.
1-001,1:aMSadaSA:4,4:1
@funderburkjim, rest all is as apt as possible.
previously added alternate headwords for dictionaries AP90 and SKD
Hmm, none for VCP?
The L-numbers should be unchanging with respect to corrections in the dictionary.
Latest update, was not always so.
This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam occurs
What exactly do you mean? Same as we order by L's, we can order here and always will the inherit order remain. Or you mean not the order, but the grammar data or a subentry for a subheadword?
Where and how do we note that there is a correction entry? Like for PWG 118042 and 118043 is an addition / correction to 872. I would add a markup for that as well at this stage, so we do not have to return to it again. See aja
. So I would not limit to
type code: the type of this extra headword pri = primary alt = alternate headword sub (?) = sub headword
but add
cor = correction / addition
Would it not make sense to exclude meta lines from the startline and endline ?
I think it makes more sense to INCLUDE the meta lines from startline and endline.
Main programming reason is that some downstream programs will need to use the <L>
meta line.
Case in point is make_xml.py. It reads each line from acchw2.txt, and
extracts lines from startline to endline (inclusive) of acc.txt. Then from these extracted lines it
creates an xml record for acc.xml. In this construction, make_xml.py definitely needs the <L>
meta
line, as it uses all the fields of that line to construct various parts of the <head>
and <tail>
of the
xml record. Then, it uses the non-meta lines from the acc.txt entry to construct the <body>
of the
xml record.
Second lesser reason is one of conceptual simplicity: the startline-endline should point to the entire scope of lines in acc.txt that pertain to the given 'L' number of the acchw2.txt record.
There might, as the question suggests, be some downstream programs that have no interest in the meta line . A parsing routine might serve as intermediary to make life easy for all downstream programs, whether they are interested in meta line or not.
Maybe we name the function 'parsedig' (for parsedigitization)
none for VCP?
We've done some preparatory work on identifying alternate headwords, but none of this has been installed thus far.
@gasyoun BTW: Did you ever show the VCP alternate headword UI to Radha? ref
also have 'cor' ?
Current pwghw2.txt records for aja (revised to show L-num)
1-0066:aja:1833,1836:872 << mentioned in comment above
1-0066:aja:1837,1838:873
5-0956:aja:134966,134967:62799
5-0956:aja:134968,134969:62800
7-1689:aja:254891,254892:118042 <<
7-1689:aja:254893,254894:118043 <<
The suggested revision to pwghw2.txt:
1-0066:aja:1833,1836:872 << no change
7-1689:aja:254891,254892:118042:cor
7-1689:aja:254893,254894:118043:cor
The suggestion is to add the metadata 'cor' to those last two records. To be useful, we would need a reference to the record being corrected, maybe via L-number.
1-0066:aja:1833,1836:872
7-1689:aja:254891,254892:118042:cor,872
7-1689:aja:254893,254894:118043:cor,872
Another place to put such meta information would be within the pwg.txt entries for those two records. Current form.
254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17;
vgl. R2V. ANUKR.¤}
254892
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤}
254894
Possible form (using pseudo xml)
254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17;
vgl. R2V. ANUKR.¤} <cor n=872>
254892
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤} <cor n=872>
254894
Such meta information would have potential utility. Another type of meta information might be the 'fehlerhafter` type.
My intuition is 'cor'-type meta data might be better embedded in pwg.txt than in pwghw2.txt.
We have previously added alternate headwords for dictionaries AP90 and SKD.
We are now wanting to add alternate headwords for ACC, and other dictionaries.
ACC.txt has been enchanced to contain a meta line with L-codes. This will make the process different than that used for AP90 and SKD.