Open funderburkjim opened 8 years ago
The statistics above summarize the cases presented in analysis2.txt, which is computed by the analysis2.py program. Each line of the analysis2 file corresponds to a record of the mw.xml Monier-Williams digitization at Cologne; and this line has 8 tab-delimited fields:
<key2>
field of mw.xml.<lex>
field of mw.xml; it contains the genders and occasional hints as to spelling of such things as feminine forms of adjectives.Here is a brief description of the Reason codes. It is a paraphrase of the analysis2.py program, which is quite complex. Keep in mind that some of the derivations may seem 'weak' upon close examination. That's partly because of current incompleteness in understanding MW's system. The aim of the analysis is to perfect this understanding. These descriptions embody the current understanding. With these caveats, the descriptions will begin. I'll put each of the 17 cases into a separate comment, and will list them in the same order that the analysis2 program uses. In another issue, I'll indicate where I think we can apply energy to improve the system.
The analysis2 program initializes the output from the all.txt file. There are currently 220,265 records in these files. Note: There are 286812 records in mw.xml; all.txt skips records that are considered 'duplicates' in the context of this analysis2 investigation; these duplicates may be characterized as records which give alternate senses of other records.
all.txt has five fields, which are the same as the first 5 fields of analysis2.txt, as described above (H code, Lnum, key1, normalized key2, 'type').
The remaining 3 fields (derivation, status, reason) of analysis2 are initialized as follows.
The derivation field is set to the empty string, and the reason note is set to init
.
If the 'type code' indicates a gender (m,f,n) or an indeclineable OR
if the word is a special word (cardinal number or pronoun, type code starts with LEXID), OR
if the word is an inflected word (like agnAmarutO) where type code starts with INFLECTID, OR
a loan word (currently only 3 of these, type code is LOAN.
then the status code is set to TODO
OTHERWISE, the status code is set to NTD
(Nothing to do). The type codes for these NTD cases
indicate that the word is either a VERB, or ICF (in compound for word), a SEE (purely referential word), or NONE (no classification currently available).
There are currently 22,100 NTD records.
Within the NTD group are 6851 NONE words; in some later stage of this work, these need to be examined and judicious enhancements made to MW to augment MW's omissions where possible.
The analysis now proceeds to analyze each record identified as TODO
by the initialization step.
It tries the following analytical methods in order (functions analysis_all, analysis_rec):
noparts,wsfx,cpd1,gender,
cpd_nan,cpd3,inflected,pfx1,
cpd1a,pfx2,cpd4,srs2,pfxderiv,
cpd5
If any analysis succeeds, the status code is changed from TODO
to DONE
, and the derivation
and reason
fields are filled in. The reason field is filled in with the method code (noparts, wsfx, etc) with possible auxiliary details, as will be described below.
If none of the analyses succeeds, slight variants of the analyses are tried, as will be discussed later.
noparts If the word is a LOAN word, or if the normalized key2
field has no parts to analyze, then
the analysis succeeds, and the derivation is set to the key2
field, and the reason is noparts
.
The presence of so-called parts within the key2
field is indicated by the presence of one of three
special characters (these examples have parts; they are not noparts cases):
devendra
is deve@ndra
, which later analysis (srs2) will resolve as deva+indra.
'~' I'm not sure what to call this. It occurs in a relatively few (16,000) cases in the current MW
digitization. It represents an ellipsis character ° in the scans. Here is an example of one type:
1 133381 pratisaMcar pratisaMcar VERB:K
3 133382 pratisaMcara prati~saMcara m
wsfx (Whitney suffix)
Consider the first example occurring in analysis2.txt, aMSavat :
1 10 aMSa aMSa m aMSa DONE noparts
3 28 aMSavat aMSa-vat m aMSa-vat DONE wsfx:vat:w1233
aMSa
is the parent of aMSa-vat
. This notion of parent
is central to the reasoning of analysis2 program. It is based on (a) the sequential ordering of the dictionary entries and (b) the assigned H-codes. In the example, the H-code of aMSavat is 3
, so the parent is, by definition, the previous record whose H-code is either 1 or 2. In our case that previous record is the one shown for aMSa, so that record is the parent of the aMSa-vat record.
The wsfx
logic now looks at the key2 structure of aMSa-vat, and analyzes it, if possible into two
parts. X-Y
, where 'Y' is the last part (vat) and X
is the rest (aMSa). Now, if X is the parent headword (it is) AND ifY is in a list of secondary suffixes derived from Whitney's Grammar (it is), then the analysis of aMSavat as being a Whitney suffix of its parent succeeds, and the status field is marked as DONE, and the reason field is filled in as shown (wsfx:vat:w1233
). The '1233' means that 'vat' is mentioned as a secondary suffix in section 1233 of Whitney's Grammar.
NOTE : This example also illustrates that the analysis is sensitive to the order in which the various analytical approaches are tried. It just so happens that 'vat' shows up as a word in the dictionary (not all the suffixes do so show up, I think). So, if we had done the compound analysis (cpd1
method) before the wsfx
method, then aMSavat would have been classified as a compound (if it looks like a duck and walks like a duck, it must be a duck). Recognition of this as a source of misclassification led to the choice to do wsfx
method first.
cpd1 simple compound based on parent.
Example:
1 10 aMSa aMSa m aMSa DONE noparts
3 20 aMSakaraRa aMSa-karaRa n aMSa-karaRa DONE cpd1
The logic is similar to that for wsfx, We try to decompose key2 as 'X-Y', where X is key1
for the parent and Y is found as a headword in the MW dictionary. Of course key2 already has this particularly simple structure, and karaRa is a headword, so the analysis of aMSakaraRa succeeds as a cpd1.
Note: The analysis is sensitive to the coding of key2. Recall that what is called 'key2' in this analysis is in fact a simplification of the coding of key2 as it appears in mw.xml. The primary simplification do such things as reduce multiple '-' to a single '-', change the <srs/>
to @
, change <sr/>
to ~
.
Suppose, in our example, that the (normalized key2) turned out to be aMSa~karaRa
(a tilde rather than a hyphen); then, the cpd1 classification would have failed. This observation implies that it is possible that some of the remaining 8000 TODO
cases may be due to a miscoding of key2.
Consider example:
3 65 aMSumat aMSu-mat m:f:n aMSu-mat DONE wsfx:mat:w1235
4 73 aMSumatPalA aMSu-mat-PalA f aMSumat-PalA DONE cpd1
Here key2 is aMSu-mat-PalA, and the program is smart enough to join the first two parts of key2 to get, in effect, a better key2 'aMSumat-PalA', which passes the cpd1 analysis.
And one more example:
1 252 akAla a-kAla m a-kAla DONE cpd_nan
3 262 akAlameGodaya a-kAla-meGo@daya m akAla-meGodaya DONE cpd1
Here the program has ignored the '@' in the second component of the derivation, since meGodaya is also a headword.
Incidentally, the derivation field (third from last) shows what might be thought of as an 'improved' key2 for cpd1 cases; we may at some point want to include this improved key2 in the mw.xml file, as metadata that could be used to generate additional intra-headword links.
gender
This applies only to cases where the record has H-code ending in 'B'; these occur, by definition, in entries where there are several senses and some senses have different gender information than that which appears in the first sense.
3 65 aMSumat aMSu-mat m:f:n aMSu-mat DONE wsfx:mat:w1235
3B 69 aMSumat aMSu-mat m aMSumat DONE gender:m of aMSumat
3B 71 aMSumatI aMSu-matI f aMSumatI DONE gender:f of aMSumat
We're not looking at parent records here, but, for two '3B' cases were are looking at the preceding record whose H-code is a plain '3'; for sake of discussion, let's call that record (the first one in our example) the gender-parent. In the first 3B case, the spelling of the headword is the same as the spelling of the gender-parent, so the analysis succeeds. In the second 3B case, the spelling is identified as the feminine form of the gender-parent, so again the analysis succeeds.
Note: The identification of gender variants is rather primitive; it may be that some of the TODO cases with H-code of form '1B,2B,3B,4B' would succeed with a more sophisticated analysis.
Programming note: what I called 'gender-parent' above is obtained via the 'parenta' attribute of a headword record.
cpd-nan
This analysis applies only to cases with H-code 1 or 2. It is for identifying nan-tatpuruza compounds.
We see if the key2 form can be analyzed as 'X-Y', where X is either 'a' or 'an' and Y is found as a headword.
1 3741 adfS a-dfS m:f:n a-dfS DONE cpd_nan
Since dfS is found as a headword, the analysis of a-dfS as a cpd_nan succeeds.
Note: The observant reader may complain that 'dfS' is a verb, so surely something is wrong. However, dfS also appears in MW dictionary as a nominal form; so having dfS as a pada in a compound is meaningful.
The example does raise a subtle point. In analyzing nominal forms (or indeclineables) -- which is what this analysis2 program does -- we often have occasion to ask if a given word fragment is a headword. And in this program, we exclude verb-only headwords when responding to this question.
Further, we slightly expand the list of known substantives to include also some implied headwords; for instance from headword adfQa we include as an additional implied headword
1 3735 adfQa a-dfQa m:f:n a-dfQa DONE cpd_nan
the feminine form adfQA, even though this is not an explicit headword in MW.
cpd3
This form of analysis is a simple extension of the cpd1 form. Where the cpd1 analysis succeeds when key2 can be partitioned into two parts, where the first part is the parent and the second part is a headword, by contrast the cpd3 analysis succeeds when key2 can be partitioned into 3 or more parts, where the first part is the parent, and the other parts are headwords. In both cases, the partitioning is done on the presence of '-' in key2. Also, the analysis ignores the presence of '~' and '@' in key2.
1 226 akARqa a-kARqa m:f:n a-kARqa DONE cpd_nan
3 229 akARqapAtajAta a-kARqa-pAta-jAta m:f:n akARqa-pAta-jAta DONE cpd3
The reason this failed the cpd1 analysis is that pAtajAta is not a headword in MW.
inflected
This analysis applies only when the H-code ends in 'C'; such entries are inflected forms of the 'parenta' record. (See 'gender' section above for 'parenta').
1 226 akARqa a-kARqa m:f:n a-kARqa DONE cpd_nan
1C 228 akARqe a-kARqe ind akARqe DONE inflected:case 7 of akARqa
The 'C' designation in '1C' H code for akARqe indicates that this headword is an inflected form of its associated 'parenta', namely the '1' H-code record with headword 'akARqa'. In this case, it is recognized that akARqe is the locative (case 7) of akARqa.
A quite limited algorithm is used to determine the inflected form; but the algorithm is sufficient to handle all but 34 of the 'C' designations present in the dictionary.
Here is an example where the classification seems strained:
1C 254.1 akAlatas a-kAla-tas ind akAlatas DONE inflected:case wsfx-tas of akAla
pfx1
In this method of analysis, the key2 form must be of the form X-Y where (a) X is one of a known list of prefixes (in program variable 'known_prefixes') and (b) Y (after ignoring '-,~,@') is determined to be a known word. The 'parent' record is not involved in this method.
1 2918 atikeSara ati-keSara m ati-keSara DONE pfx1:ati
Since 'ati' is a known prefix, and 'keSara' is a headword, the pfx1 analysis succeeds.
There are several subtle points:
When 'ati' is a prefix to a word whose spelling is a vowel, then it takes the form 'aty':
1 3394 atyagni aty-agni m aty-agni DONE pfx1:aty
This is handled by including 'aty' in the list of known prefixes. And similarly for several other prefixes.
The privatives 'a' and 'an' are included in the list of known prefixes.
1 252 akAla a-kAla m a-kAla DONE cpd_nan
3 254.11 akAlaka a-kAlaka n a-kAlaka DONE pfx1:a
cpd1a
This is similar to cpd1. The key2 form of the word is 'X-Y', where Y is a known word and 'X' is derived from the parent by gender or inflection.
2 88 aMsa aMsa m aMsa DONE noparts
3 101 aMseBAra aMse-BAra m aMse+BAra DONE cpd1a:aMse<-aMsa
1 1622 aNgana aNgana n aNgana DONE noparts
3 1627 aNganAgaRa aNganA-gaRa m aNganA+gaRa DONE cpd1a:aNganA<-aNgana
In the first case, aMse is the locative of parent aMsa; In the second case, aNganA is a feminine form of the parent aNgana.
Seems like a Ph.D. in itself. Good work Jim!
pfx2
This form of analysis brings into focus the subtle distinction between '-' and '~' in key2, and is the first form of analysis that actively uses '~'. The key2 form must be 'X~Y', where X is a known prefix, and Y is a known headword.
1 107824 nikam nikam VERB:K NTD init
3 107826 nikAmana ni~kAmana n ni+kAmana DONE pfx2:ni
Since 'ni' is a known prefix, and kAmana is a known headword, the analysis succeeds.
The 'parent', which is the prefixed verb 'ni-kam', is unused in the analysis.
Observation of the pfx2 cases shows that most of them occur when the parent is a prefixed root, and the child (nikAmana) is an H3. This is quite a different relation between parent and child than the prototypical cpd1 relation mentioned by Monier in his description of 4 lines of words. There is much more to say about this.
@drdhaval2785 Glad you found this. When I finish with this documentation, I'm hoping you will be intrigued to work on this and help bring it to a point of completion. It has a bearing on our correction work; because, if we can know the derivation of a word in terms of other words, then we have a strong indirect affirmation of the spelling of those words.
By the way, I'm not sure if this repository is set up to allow you to participate -- If not, and you want to participate, maybe you can remind me how to add you as a participant.
cpd4
This analysis applies only to H-codes of 1 or 2.
In this case, we split the word into parts via the '-' separator, and check if all the parts are found as headwords. If so, the analysis succeeds.
1 3126 atimanuzyabudDi ati-manuzya-budDi m:f:n ati-manuzya-budDi DONE cpd4
In this case, since all three components are headwords, the analysis succeeds.
Note that the pfx1 analysis failed because that method would require a form X-Y or ati-manuzyabudDi, and would require manuzyabudDi to be a headword, which it is not.
The gloss in MW is 'having a superhuman intellect'; 'ati-manuzya' corresponds to 'superhuman' and 'budDi' to intellect, so 'atimanuzya-budDi' would be a more useful derivation; and it would be useful to have a dictionary entry 'atimanuyza' (key2=ati-manuzya) with meaning 'superhuman'.
srs2
'srs' is an acronym for 'Simple Replacement Sandhi', (a term coined by Peter Scharf, I think); it is represented in our normalized coding of key2 by the '@' character. See devendra
example above.
1 95518 deva deva m:f#I:n deva DONE noparts
3 96282 devendra deve@ndra m deva+indra DONE srs2
In the srs2 analysis, the key2 spelling is split on the '@' character; if there is no '@' in key2, the analysis fails. In fact, key2 must split into two parts 'X@Y', where X and the parent P agree in all but the last character: X=Uv and P=Ur (P = deva, X = deve, U = dev, v = e, r=a; Y = ndra).
Now we want to find vowel 'z' so that the vowel sandhi of r+z = v ; typically there are two choices for 'z', since long and short vowels are indistinguishable after vowel sandhi. For our example, 'z' can be 'i' or 'I', since a+i->e, and a+I->e by simple vowel sandhi.
Now, we search for a headword spelled W ='zY' for either choice of 'z' (W='indra' or 'Indra', in our example).
If either (or both) values of W is a headword, the analysis succeeds, and we have a derivation P+W.
The program actually is slightly more general in analyzing W. Namely, we allow W to be a 'floating compound' (function floating_compounds). This means that we split W into parts on the compound separator character '-', resulting in components W1,...,Wn; then join these in all possible ways that result in headwords. For Example if W = a-b-c, then we search for 'abc', 'a' and 'bc', 'ab' and 'c', and 'a' and 'b' and 'c' as headwords and return as derivations any that succeed, returning as 'abc', 'a-bc', 'ab-c', 'a-b-c'.
In the devendra case, W has only 1 part (either 'indra' or 'Indra'), and only 'indra' is a headword. So the resulting derivation is deva+indra and the srs2 derivation suceeds.
Here is an example where there are 2 possible derivations - the Reason code is marked 'srs2?' to indicate the ambiguity of the derivation. Choosing between the derivations is beyond the scope of the analysis, and indeed I don't know how this may be done in a systematic way.
1 10 aMSa aMSa m aMSa DONE noparts
3 33 aMSAMSa aMSA@MSa m aMSa+aMSa,aMSa+AMSa DONE srs2?
Here is an example where 'W' had two parts:
3 1163 agnIzomIya agnIzomIya m:f:n agnIzomIya DONE noparts
4 1167 agnIzomIyEkAdaSakapAla agnIzomI~yE@kAdaSa-kapAla m
agnIzomIya+ekAdaSa-kapAla DONE srs2
pfxderiv
This analysis is based on a set of words (in file auxiliary/pfxderiv.txt) based on two sources:
For example: for ati-kram:
From:
2919 atikram ati-kram kram ati ati+kram
AND
kram kram krama kramin kramya kramaRa kramaRIya kramitavya krAma krAmin krAmya krAmaRa krAmuka krAMti krAMtf caNkrama caNkramaRa krAmayitavya
CONSTRUCT the record of pfxderiv.txt:
2919 atikram ati-kram kram ati ati+kram atikrama atikramin atikramya atikramaRa atikramaRIya atikramitavya atikrAma atikrAmin atikrAmya atikrAmaRa atikrAmuka atikrAMti atikrAMtf aticaNkrama aticaNkramaRa atikrAmayitavya
So the list of words atikrama, atikramin, etc. are considered to be available explanations.
1 2919 atikram atikram VERB:K NTD init
2 2931 atikramin ati-kramin m:f:n atikramin DONE pfxderiv:ati+kram
The parent of atikramin is the prefixed verb atikram. In the list of prefixed derivatives for atikram we find atikramin. Thus the analysis succeeds and we note 'ati+kram' as a sub-note in the Reason field.
The reason that ati-kramin had not been previously analyzed with the pfx1 method is that 'kramin' is not a separate headword in MW. By contrast, ati-krama, ati-kramaRa and several others under ati-kram are analyzed by the pfx1 method, since krama, kramaRa, etc. are found as separate headwords in MW. So, in effect, in the pfxderv method, we are extending the headword list of MW to include all the Whitney derivatives, at least insofar as these derivatives appear in prefixed verb derivatives as headwords in MW.
cpd5
This analysis uses the 'floating_compound' method mentioned above in connection with srs2. The parent word is unused. We split the word into parts based on the compound separator '-', and then partition the parts into sub-sequences, and select those partitions which lead to known headwords.
1 116890 parAmfS parAmfS VERB:K NTD init
3 116901 parAmarSana parA-marSana n parA-marSana DONE cpd5
Since both parA and marSana occur as headwords, the cpd5 analysis succeeds.
It is interesting to investigate why other forms of analysis fail.
cpd1 fails since the parent is not 'parA', but rather parAmfS.
pfx1 fails since 'parA' is not included among the known prefixes (Is this omission an error in the list of known prefixes?)
We've now documented all the main analytical methods.
However, in an attempt to explain more entries, several variations to these methods were applied.
While these use the same primary reason codes (analytical methods) as just documented, they apply some pre/post processing; in the analysis2.txt these cases may be found by search for a plus sign '+' in the Reason field (last field). As of this writing, 3024 such cases are involved.
In the analysis2.py program, these variants are identified by one of five option codes:
m, cC, z, fauxcpd, removesfx
The next sections describe these options in turn.
m option code
This applies only to cases where the headword ends in 'm'.
1 252 akAla a-kAla m a-kAla DONE cpd_nan
3 265 akAlahInam a-kAla-hInam ind akAla-hIna DONE cpd1:+m
Here's how the option works. We first make a hypothetical entry by removing the final 'm' from both key1 and key2. (key1=akAlahIna, key2=a-kAla-hIna); then we cycle through the analytical methods in order. In this case, since 'akAla' is the parent, and 'hIna' is a headword, the cpd1 analysis succeeds on the hypothetical entry. We supply the resulting 'derivation' (akAla-hIna) and reason code (cpd1), and then affix the '+m' to the Reason code as a reminder that we had to apply this m-removal for the analysis to succeed.
cC option code
This option applies only to headwords whose spelling has occasions of 'cC' . In such cases, the 'cC' is changed to 'C' in key1 and key2, and the normal analytical techniques are applied to the resulting hypothetical record. If this hypothetical record has an analysis, then the original record is analyzed identically, and the option '+cC' is affixed to the reason code.
1 1192 agra agra m:f:n agra DONE noparts
3 1219.1 agracCada agra-cCada n agra-Cada DONE cpd1:+cC
Presumably, there is a grammar rule which supports the 'cC' spelling in certain situations of compound formation.
z option code
This option deals with recognition of a few sandhi changes that occur in the course of compound formation.
subcase 1. The first one involves the change of 's' to 'z' . After a compound separator (the '-' character) in key2, a 'z' is changed to 's' (and also 'zW' ->'sT', and 'zw' -> 'st'). This results in a new hypothetical key2; if one of the analytical methods recognizes this hypothetical key2, then the original headword is assumed to have the same analysis; and the indicatory '+z' is appended to the reason code.
1 890 agni agni m agni DONE noparts
3 1091 agnizwut agni-zwut m agni+stut DONE cpd1:+z
So in this case, agnizwut is a compound of the parent 'agi' and the headword 'stut'.
subcase 2 A second possibilty changes 'a-r' in key2 to 'a-f'.
2 1928 aja aja m aja DONE noparts
3 1991 ajarzaBa aja-rzaBa m aja+fzaBa DONE cpd1:+a-r
subcase 3 A third possibility changes 'a-R' in key2 to 'a-n'
1 890 agni agni m agni DONE noparts
3 1232 agraRIti agra-RIti f agra+nIti DONE cpd1:+a-R
There are relatively few (36) of these. In general, the n-R sandhi does not cross a pada boundary in compounds ( for instance, agranaKa = agra-naKa, the 'n' of 'naKa' is unaffected in the compound by the presence of 'r' in the preceding pada). Probably there are one or more special sandhi rules that are used to explain the exceptions such as agraRIti.
NOTE: It might be the case that there are other sandhis that occur in compounds that have not so far been analyzed (status code = TODO) and for which additional subcases could be developed to provide analysis.
fauxcpd option code
Although the distinction between the '-' and '~' characters in simplified key2 is typographically clear (see the scan snippets in a comment above), the distinction from the point of view of our analytical derivations is not so clear. That is, although our analytical algorithms make specific (and therefore clear) assumptions about the interpretation, it might be that looser assumptions could be useful.
This option deals with a fairly limited case of loosening this assumption. In particular, it only treats cases where there is a single '~' in key2 AND no '-' or '@'; i.e., key2 has the form 'X~Y' and X and Y are composed just of letters. In such a case, an analysis is made of the hypothetical key2 'X-Y' (change the '~' to '-'). If the hypothetical key2 is identified by analysis, then the original word is considered to be analyzed, and an indicatory '+fauxcpd' is appended to the reason code.
3 1163 agnIzomIya agnIzomIya m:f:n agnIzomIya DONE noparts
4 1163.1 agnIzomIyanirvApa agnIzomIya~nirvApa m agnIzomIya-nirvApa DONE cpd1:+fauxcpd
In this example, the scan actually shows a '-', rather than the degree-sign which '~' is supposed to represent; so, we could say that this example is a coding error in MW.
1 28925 indra indra m indra DONE noparts
3 28985 indrajAla indra-jAla n indra-jAla DONE cpd1
3 28993 indrajAlika indra~jAlika m indra-jAlika DONE cpd1:+fauxcpd
In this case, the scan actually does have the degree-sign, so the '~' coding is faithful to the printing. This probably indicates that indrajAlika is considered to be derived, by a secondary suffix 'ika' from indrajAla, rather than by derivation as a compound of indra and jAlika. However, the typography of the text tends to obscure the relation to indrajAla, in my opinion. The H3 coding accurately reflects the typography, but not the derivational relation, in this case.
NOTE: It should be possible to extend the fauxcpd option to such cases as X~Y-Z
, X~Y~X
, X-Y~Z
, etc. No way currently of predicting how many new analyses woudl accrue.
removesfx option code
Note that the indicatory sign here is +wsfx
.
This option applies only to those cases where the headword (key1) spelling has the form XY, where
Y is one of the known Whitney suffixes (as used in the wsfx method). If so, then there are two
analytical possibilities:
1. The resulting 'X' is a headword. The Derivation is then X + Y, and the reason code is '+wsfx1:Y'
4 1154 agnIDra agnI@Dra m agnID+ra DONE +wsfx1:ra
Based on Whitney's description of the 'ra' suffix (as a short form of the comparative 'tara'), I think
this derivation of agniDra should be viewed with scepticism.
2. After removing Y from both key1 and key2, the resulting hypthetical record is successfully
analyzed. Then +wsfx:Y
is appended to the reason code.
1 1772 acakzus a-cakzus n a-cakzus DONE cpd_nan
2 1776 acakzuzka a-cakzuzka m:f:n a-cakzuz+ka DONE cpd_nan:+wsfx:ka
This analysis seems plausible. However, although the 'a-cakzuzka' form of key2 does agree with the printed text, wouldn't a better representation be acakzuz-ka (or maybe acakzuz~ka) ?
I think all the analytical methods and optional codes have now been described in sufficient detail for this documentation comment, and will thus call this lengthy issue comment finished.
How could I ever forget this, @funderburkjim ? Only because you've done so many great things that I've lost half of them. Now, after almost 5 years, I must admit that it is of greatest interest still. Recently @Andhrabharati made me aware of http://sanskrit.jnu.ac.in/elearning/apte/shlexicon-lexicon.txt - it contains vyutpatti, derivation data based on Apte's dictionary that could be helpful, as it's not generated, but handmade around 70 years ago for the Sanskrit-Hindi edition of the Sanskrit-English dictionary. At https://groups.google.com/g/bvparishat/c/lMJQu3Zb_Vo I got https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/BU1.svg and https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/ that are hand-marked derivations based on Apte's data.
I'm looking at: https://github.com/funderburkjim/MWderivations/blob/master/step4/analysis2.txt
derivation For nouns, this indicates the derivation of the word. Empty for verbs.
For verbs with preverbs we could point out the connections? Because MW as I understand him does not state it in an explicit way and that is one of the reasons why there are no hyperlinks in between them.
TODO Nouns for which no derivation has been found.
5667 lines in 2021.
Keep in mind that some of the derivations may seem 'weak' upon close examination.
What if we compare them with Apte?
currently 220,265 records in these files. Note: There are 286812 records in mw.xml; all.txt skips records that are considered 'duplicates' in the context of this analysis2 investigation
Skips means they will have to tag at all? Or the tags given to the 220k records could be scaled to all 286k?
if the word is a special word (cardinal number or pronoun, type code starts with LEXID)
It's like an exclusion? So we have different sublists of exclusions?
ICF (in compound for word)
Without differentiating beginning or end?
Within the NTD group are 6851 NONE words; in some later stage of this work, these need to be examined and judicious enhancements made to MW to augment MW's omissions where possible.
@drdhaval2785 , please come out of the dark.
'~' I'm not sure what to call this. It occurs in a relatively few (16,000) cases in the current MW digitization. It represents an ellipsis character ° in the scans.
We had a discussion with Serge on it lately, remember, Jim?
Recognition of this as a source of misclassification led to the choice to do wsfx method first.
Have I ever told before that I absolutely love your approach?
This observation implies that it is possible that some of the remaining 8000 TODO cases may be due to a miscoding of key2.
Those 8000 TODO still remain 8000?
derivation field (third from last) shows what might be thought of as an 'improved' key2 for cpd1 cases; we may at some point want to include this improved key2 in the mw.xml file, as metadata that could be used to generate additional intra-headword links.
What I can do to move it forward? additional intra-headword links
is one of the things I really miss a lot in my everyday routines.
Note: The identification of gender variants is rather primitive; it may be that some of the TODO cases with H-code of form '1B,2B,3B,4B' would succeed with a more sophisticated analysis.
This is where I ask you to take a look at my https://github.com/funderburkjim/MWderivations/issues/11
we include as an additional implied headword the feminine form adfQA, even though this is not an explicit headword in MW.
How many such alternate and non-existing entries were required?
The reason this failed the cpd1 analysis is that pAtajAta is not a headword in MW.
failed the cpd1 analysis, but succeeded with cpd3, right.
This is handled by including 'aty' in the list of known prefixes
Tried to find the full list, but failed at https://github.com/funderburkjim/MWderivations/blob/e887f3a62c6af1299f3605d3540b8b7f56ebd974/step3/auxiliary/pfxderiv.py#L18
What is the 'right way' to think about the derivation of akAla ? I'm not sure.
@drdhaval2785 , please come out of the dark.
Observation of the pfx2 cases shows that most of them occur when the parent is a prefixed root, and the child (nikAmana) is an H3. This is quite a different relation between parent and child than the prototypical cpd1 relation mentioned by Monier in his description of 4 lines of words. There is much more to say about this.
And this is where the biggest fun actually starts, getting closer to the dhātu and sopasarga dhātu relationship that I'm so interested in.
if we can know the derivation of a word in terms of other words, then we have a strong indirect affirmation of the spelling of those words.
Exactly!
Note that the pfx1 analysis failed because that method would require
Absolutely amazing and so needed documentation of not only why it worked, but why the previous steps did not.
where there are 2 possible derivations - the Reason code is marked 'srs2?' to indicate the ambiguity of the derivation. Choosing between the derivations is beyond the scope of the analysis, and indeed I don't know how this may be done in a systematic way.
@drdhaval2785 , please come out of the dark.
in the pfxderv method, we are extending the headword list of MW to include all the Whitney derivatives
It's a gem.
pfx1 fails since 'parA' is not included among the known prefixes (Is this omission an error in the list of known prefixes?)
@drdhaval2785 , please come out of the dark.
NOTE: It might be the case that there are other sandhis that occur in compounds that have not so far been analyzed (status code = TODO) and for which additional subcases could be developed to provide analysis.
@drdhaval2785 , please come out of the dark.
indrajAlika is considered to be derived, by a secondary suffix 'ika' from indrajAla, rather than by derivation as a compound of indra and jAlika. However, the typography of the text tends to obscure the relation to indrajAla, in my opinion. The H3 coding accurately reflects the typography, but not the derivational relation, in this case.
@drdhaval2785 , please come out of the dark.
NOTE: It should be possible to extend the fauxcpd option to such cases as X~Y-Z, X~Y~X, X-Y~Z
What is missing for that next step?
Based on Whitney's description of the 'ra' suffix (as a short form of the comparative 'tara'), I think this derivation of agniDra should be viewed with scepticism.
@drdhaval2785 , please come out of the dark.
This analysis seems plausible. However, although the 'a-cakzuzka' form of key2 does agree with the printed text, wouldn't a better representation be acakzuz-ka (or maybe acakzuz~ka) ?
@drdhaval2785 , please come out of the dark.
How could I ever forget this, @funderburkjim ? Only because you've done so many great things that I've lost half of them. Now, after almost 5 years, I must admit that it is of greatest interest still. Recently @Andhrabharati made me aware of http://sanskrit.jnu.ac.in/elearning/apte/shlexicon-lexicon.txt - it contains vyutpatti, derivation data based on Apte's dictionary that could be helpful, as it's not generated, but handmade around 70 years ago for the Sanskrit-Hindi edition of the Sanskrit-English dictionary. At https://groups.google.com/g/bvparishat/c/lMJQu3Zb_Vo I got https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/BU1.svg and https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/ that are hand-marked derivations based on Apte's data.
This Skt-Hindi dictionary is a translation of Apte1890 Skt-English dictionary, need to see if that is the larger Practical ed. or the somewhat smaller Student's ed., looking closer.
And this has some extra details added (like this vyutpatti, and some annexures) benefitting the students, as the intro pages said.
This work was taken up as a competitor to Apte1957 (Prasad's revised ed.), by MLBD.
Also they (MLBD) had annexed some 10000 entries to Apte90 Practical ed., in their reprint, which I guess could be added to your Apte90 data.
This is the Student's ed. of Apte90, and has the annexure of 10000 new entries to MLBD Apte90 Practical ed., included in the Hindi translation.
Here is part of the log (step4/redo_log.txt) printed by the step4/redo.sh update, run today. I'll explain things later.