Summary of step4 - Githubissues

funderburkjim / MWderivations

Derivations of headwords in the Monier-Williams (1899) dictionary

1 stars 1 forks source link

Summary of step4 #2

Open funderburkjim opened 8 years ago

funderburkjim commented 8 years ago

Here is part of the log (step4/redo_log.txt) printed by the step4/redo.sh update, run today. I'll explain things later.

 22100  NTD init
 85770 DONE cpd1
  1418 DONE cpd1a
  4379 DONE cpd3
  1095 DONE cpd4
  1792 DONE cpd5
  3295 DONE cpd_nan
 12547 DONE gender
  1188 DONE inflected
 42772 DONE noparts
  8666 DONE pfx1
  2583 DONE pfx2
  1576 DONE pfxderiv
 15048 DONE srs2
  6492 DONE wsfx
   651 DONE wsfx1
  8893 TODO init

funderburkjim commented 8 years ago

The statistics above summarize the cases presented in analysis2.txt, which is computed by the analysis2.py program. Each line of the analysis2 file corresponds to a record of the mw.xml Monier-Williams digitization at Cologne; and this line has 8 tab-delimited fields:

H-code The Hierarchical code based on the 4-fold hierarchy of words that MW devised.
L-number The record identifier.
key1 The spelling of the headword for the record; in SLP1 transliteration.
key2 The expanded headword; this is a systematic simplification of the <key2> field of mw.xml.
type. For nouns, this is a normalized representation of the <lex> field of mw.xml; it contains the genders and occasional hints as to spelling of such things as feminine forms of adjectives.
derivation For nouns, this indicates the derivation of the word. Empty for verbs.
status code, indicates status of derivation attempts:
- DONE a derivation has been found
- NTD Nothing to do, applies to verbs, and to words whose 'type' is NONE,
- TODO Nouns for which no derivation has been found.
Reason: A brief indication of the part of the program that claimed to find a derivation. It starts with one of the codes listed in the summary statistics shown above,

funderburkjim commented 8 years ago

Here is a brief description of the Reason codes. It is a paraphrase of the analysis2.py program, which is quite complex. Keep in mind that some of the derivations may seem 'weak' upon close examination. That's partly because of current incompleteness in understanding MW's system. The aim of the analysis is to perfect this understanding. These descriptions embody the current understanding. With these caveats, the descriptions will begin. I'll put each of the 17 cases into a separate comment, and will list them in the same order that the analysis2 program uses. In another issue, I'll indicate where I think we can apply energy to improve the system.

funderburkjim commented 8 years ago

The analysis2 program initializes the output from the all.txt file. There are currently 220,265 records in these files. Note: There are 286812 records in mw.xml; all.txt skips records that are considered 'duplicates' in the context of this analysis2 investigation; these duplicates may be characterized as records which give alternate senses of other records.

all.txt has five fields, which are the same as the first 5 fields of analysis2.txt, as described above (H code, Lnum, key1, normalized key2, 'type').

The remaining 3 fields (derivation, status, reason) of analysis2 are initialized as follows.

The derivation field is set to the empty string, and the reason note is set to init.

If the 'type code' indicates a gender (m,f,n) or an indeclineable OR if the word is a special word (cardinal number or pronoun, type code starts with LEXID), OR if the word is an inflected word (like agnAmarutO) where type code starts with INFLECTID, OR a loan word (currently only 3 of these, type code is LOAN. then the status code is set to TODO OTHERWISE, the status code is set to NTD (Nothing to do). The type codes for these NTD cases indicate that the word is either a VERB, or ICF (in compound for word), a SEE (purely referential word), or NONE (no classification currently available).

There are currently 22,100 NTD records.

Within the NTD group are 6851 NONE words; in some later stage of this work, these need to be examined and judicious enhancements made to MW to augment MW's omissions where possible.

funderburkjim commented 8 years ago

The analysis now proceeds to analyze each record identified as TODO by the initialization step.

It tries the following analytical methods in order (functions analysis_all, analysis_rec):

 noparts,wsfx,cpd1,gender,
 cpd_nan,cpd3,inflected,pfx1,
 cpd1a,pfx2,cpd4,srs2,pfxderiv,
 cpd5

If any analysis succeeds, the status code is changed from TODO to DONE, and the derivation and reason fields are filled in. The reason field is filled in with the method code (noparts, wsfx, etc) with possible auxiliary details, as will be described below.

If none of the analyses succeeds, slight variants of the analyses are tried, as will be discussed later.

funderburkjim commented 8 years ago

noparts If the word is a LOAN word, or if the normalized key2 field has no parts to analyze, then the analysis succeeds, and the derivation is set to the key2 field, and the reason is noparts.

The presence of so-called parts within the key2 field is indicated by the presence of one of three special characters (these examples have parts; they are not noparts cases):

'-' indicates a simple compound. For example, under 'deva' we have devasvamin, whose key2 spelling is 'deva-svamin`, which is a tatpuruza compound.
'@' indicates a compound with simple replacement sandhi. MW indicates these cases by putting a circumflex diacritic over the vowel which results from the sandhi. For instance, the normalized key2 spelling of devendra is deve@ndra, which later analysis (srs2) will resolve as deva+indra.
'~' I'm not sure what to call this. It occurs in a relatively few (16,000) cases in the current MW digitization. It represents an ellipsis character ° in the scans. Here is an example of one type:
```
1 133381  pratisaMcar pratisaMcar VERB:K
3 133382  pratisaMcara    prati~saMcara   m
```

funderburkjim commented 8 years ago

wsfx (Whitney suffix)

Consider the first example occurring in analysis2.txt, aMSavat :

1   10  aMSa    aMSa    m   aMSa    DONE    noparts
3   28  aMSavat aMSa-vat    m   aMSa-vat    DONE    wsfx:vat:w1233

aMSa is the parent of aMSa-vat. This notion of parent is central to the reasoning of analysis2 program. It is based on (a) the sequential ordering of the dictionary entries and (b) the assigned H-codes. In the example, the H-code of aMSavat is 3, so the parent is, by definition, the previous record whose H-code is either 1 or 2. In our case that previous record is the one shown for aMSa, so that record is the parent of the aMSa-vat record.

The wsfx logic now looks at the key2 structure of aMSa-vat, and analyzes it, if possible into two parts. X-Y, where 'Y' is the last part (vat) and X is the rest (aMSa). Now, if X is the parent headword (it is) AND ifY is in a list of secondary suffixes derived from Whitney's Grammar (it is), then the analysis of aMSavat as being a Whitney suffix of its parent succeeds, and the status field is marked as DONE, and the reason field is filled in as shown (wsfx:vat:w1233). The '1233' means that 'vat' is mentioned as a secondary suffix in section 1233 of Whitney's Grammar.

NOTE : This example also illustrates that the analysis is sensitive to the order in which the various analytical approaches are tried. It just so happens that 'vat' shows up as a word in the dictionary (not all the suffixes do so show up, I think). So, if we had done the compound analysis (cpd1 method) before the wsfx method, then aMSavat would have been classified as a compound (if it looks like a duck and walks like a duck, it must be a duck). Recognition of this as a source of misclassification led to the choice to do wsfx method first.

funderburkjim commented 8 years ago

cpd1 simple compound based on parent.

Example:

1   10  aMSa    aMSa    m   aMSa    DONE    noparts
3   20  aMSakaraRa  aMSa-karaRa n   aMSa-karaRa DONE    cpd1

The logic is similar to that for wsfx, We try to decompose key2 as 'X-Y', where X is key1 for the parent and Y is found as a headword in the MW dictionary. Of course key2 already has this particularly simple structure, and karaRa is a headword, so the analysis of aMSakaraRa succeeds as a cpd1.

Note: The analysis is sensitive to the coding of key2. Recall that what is called 'key2' in this analysis is in fact a simplification of the coding of key2 as it appears in mw.xml. The primary simplification do such things as reduce multiple '-' to a single '-', change the <srs/> to @, change <sr/> to ~.
Suppose, in our example, that the (normalized key2) turned out to be aMSa~karaRa (a tilde rather than a hyphen); then, the cpd1 classification would have failed. This observation implies that it is possible that some of the remaining 8000 TODO cases may be due to a miscoding of key2.

Consider example:

3   65  aMSumat aMSu-mat    m:f:n   aMSu-mat    DONE    wsfx:mat:w1235
4   73  aMSumatPalA aMSu-mat-PalA   f   aMSumat-PalA      DONE  cpd1

Here key2 is aMSu-mat-PalA, and the program is smart enough to join the first two parts of key2 to get, in effect, a better key2 'aMSumat-PalA', which passes the cpd1 analysis.

And one more example:

1   252 akAla   a-kAla  m   a-kAla  DONE    cpd_nan
3   262 akAlameGodaya   a-kAla-meGo@daya    m   akAla-meGodaya  DONE    cpd1

Here the program has ignored the '@' in the second component of the derivation, since meGodaya is also a headword.

Incidentally, the derivation field (third from last) shows what might be thought of as an 'improved' key2 for cpd1 cases; we may at some point want to include this improved key2 in the mw.xml file, as metadata that could be used to generate additional intra-headword links.

funderburkjim commented 8 years ago

gender

This applies only to cases where the record has H-code ending in 'B'; these occur, by definition, in entries where there are several senses and some senses have different gender information than that which appears in the first sense.

3   65  aMSumat aMSu-mat    m:f:n   aMSu-mat    DONE    wsfx:mat:w1235
3B  69  aMSumat aMSu-mat    m   aMSumat DONE    gender:m of aMSumat
3B  71  aMSumatI    aMSu-matI   f   aMSumatI    DONE    gender:f of aMSumat

We're not looking at parent records here, but, for two '3B' cases were are looking at the preceding record whose H-code is a plain '3'; for sake of discussion, let's call that record (the first one in our example) the gender-parent. In the first 3B case, the spelling of the headword is the same as the spelling of the gender-parent, so the analysis succeeds. In the second 3B case, the spelling is identified as the feminine form of the gender-parent, so again the analysis succeeds.

Note: The identification of gender variants is rather primitive; it may be that some of the TODO cases with H-code of form '1B,2B,3B,4B' would succeed with a more sophisticated analysis.

Programming note: what I called 'gender-parent' above is obtained via the 'parenta' attribute of a headword record.

funderburkjim commented 8 years ago

cpd-nan

This analysis applies only to cases with H-code 1 or 2. It is for identifying nan-tatpuruza compounds.

We see if the key2 form can be analyzed as 'X-Y', where X is either 'a' or 'an' and Y is found as a headword.

1   3741    adfS    a-dfS   m:f:n   a-dfS   DONE    cpd_nan

Since dfS is found as a headword, the analysis of a-dfS as a cpd_nan succeeds.

Note: The observant reader may complain that 'dfS' is a verb, so surely something is wrong. However, dfS also appears in MW dictionary as a nominal form; so having dfS as a pada in a compound is meaningful.

The example does raise a subtle point. In analyzing nominal forms (or indeclineables) -- which is what this analysis2 program does -- we often have occasion to ask if a given word fragment is a headword. And in this program, we exclude verb-only headwords when responding to this question.

Further, we slightly expand the list of known substantives to include also some implied headwords; for instance from headword adfQa we include as an additional implied headword

1   3735    adfQa   a-dfQa  m:f:n   a-dfQa  DONE    cpd_nan

the feminine form adfQA, even though this is not an explicit headword in MW.

funderburkjim commented 8 years ago

cpd3

This form of analysis is a simple extension of the cpd1 form. Where the cpd1 analysis succeeds when key2 can be partitioned into two parts, where the first part is the parent and the second part is a headword, by contrast the cpd3 analysis succeeds when key2 can be partitioned into 3 or more parts, where the first part is the parent, and the other parts are headwords. In both cases, the partitioning is done on the presence of '-' in key2. Also, the analysis ignores the presence of '~' and '@' in key2.

1   226 akARqa  a-kARqa m:f:n   a-kARqa DONE    cpd_nan
3   229 akARqapAtajAta  a-kARqa-pAta-jAta   m:f:n   akARqa-pAta-jAta    DONE    cpd3

The reason this failed the cpd1 analysis is that pAtajAta is not a headword in MW.

funderburkjim commented 8 years ago

inflected

This analysis applies only when the H-code ends in 'C'; such entries are inflected forms of the 'parenta' record. (See 'gender' section above for 'parenta').

1   226 akARqa  a-kARqa m:f:n   a-kARqa DONE    cpd_nan
1C  228 akARqe  a-kARqe ind akARqe  DONE    inflected:case 7 of akARqa

The 'C' designation in '1C' H code for akARqe indicates that this headword is an inflected form of its associated 'parenta', namely the '1' H-code record with headword 'akARqa'. In this case, it is recognized that akARqe is the locative (case 7) of akARqa.

A quite limited algorithm is used to determine the inflected form; but the algorithm is sufficient to handle all but 34 of the 'C' designations present in the dictionary.

Here is an example where the classification seems strained:

1C  254.1   akAlatas    a-kAla-tas  ind akAlatas    DONE    inflected:case wsfx-tas of akAla

funderburkjim commented 8 years ago

pfx1

In this method of analysis, the key2 form must be of the form X-Y where (a) X is one of a known list of prefixes (in program variable 'known_prefixes') and (b) Y (after ignoring '-,~,@') is determined to be a known word. The 'parent' record is not involved in this method.

1   2918    atikeSara   ati-keSara  m   ati-keSara  DONE    pfx1:ati

Since 'ati' is a known prefix, and 'keSara' is a headword, the pfx1 analysis succeeds.

There are several subtle points:

When 'ati' is a prefix to a word whose spelling is a vowel, then it takes the form 'aty':
```
1   3394    atyagni aty-agni    m   aty-agni    DONE    pfx1:aty
```
This is handled by including 'aty' in the list of known prefixes. And similarly for several other prefixes.
The privatives 'a' and 'an' are included in the list of known prefixes.
```
1 252 akAla   a-kAla  m   a-kAla  DONE    cpd_nan
3 254.11  akAlaka a-kAlaka    n   a-kAlaka    DONE    pfx1:a
```
- Why was 'akAlaka' not analyzed as a 'cpd_nan' , since 'kAlaka' is a headword. The reason is that it is an 'H3', whereas cpd_nan analysis is written to handle only H1 and H2 cases.
- Why was 'akAlaka' not analyzed as a 'wsfx' , since it is formed by adding the suffix 'ka' to its parent 'akAla' ? The reason is that the key2 spelling is not given as 'akAla-ka'.
- What is the 'right way' to think about the derivation of akAla ? I'm not sure.

funderburkjim commented 8 years ago

cpd1a

This is similar to cpd1. The key2 form of the word is 'X-Y', where Y is a known word and 'X' is derived from the parent by gender or inflection.

2   88  aMsa    aMsa    m   aMsa    DONE    noparts
3   101 aMseBAra    aMse-BAra   m   aMse+BAra   DONE    cpd1a:aMse<-aMsa

1   1622    aNgana  aNgana  n   aNgana  DONE    noparts
3   1627    aNganAgaRa  aNganA-gaRa m   aNganA+gaRa DONE    cpd1a:aNganA<-aNgana

In the first case, aMse is the locative of parent aMsa; In the second case, aNganA is a feminine form of the parent aNgana.

drdhaval2785 commented 8 years ago

Seems like a Ph.D. in itself. Good work Jim!

funderburkjim commented 8 years ago

pfx2

This form of analysis brings into focus the subtle distinction between '-' and '~' in key2, and is the first form of analysis that actively uses '~'. The key2 form must be 'X~Y', where X is a known prefix, and Y is a known headword.

1   107824  nikam   nikam   VERB:K      NTD init
3   107826  nikAmana    ni~kAmana   n   ni+kAmana   DONE    pfx2:ni

Since 'ni' is a known prefix, and kAmana is a known headword, the analysis succeeds.

The 'parent', which is the prefixed verb 'ni-kam', is unused in the analysis.

Observation of the pfx2 cases shows that most of them occur when the parent is a prefixed root, and the child (nikAmana) is an H3. This is quite a different relation between parent and child than the prototypical cpd1 relation mentioned by Monier in his description of 4 lines of words. There is much more to say about this.

funderburkjim commented 8 years ago

@drdhaval2785 Glad you found this. When I finish with this documentation, I'm hoping you will be intrigued to work on this and help bring it to a point of completion. It has a bearing on our correction work; because, if we can know the derivation of a word in terms of other words, then we have a strong indirect affirmation of the spelling of those words.

By the way, I'm not sure if this repository is set up to allow you to participate -- If not, and you want to participate, maybe you can remind me how to add you as a participant.

funderburkjim commented 8 years ago

cpd4

This analysis applies only to H-codes of 1 or 2.

In this case, we split the word into parts via the '-' separator, and check if all the parts are found as headwords. If so, the analysis succeeds.

1   3126    atimanuzyabudDi ati-manuzya-budDi   m:f:n   ati-manuzya-budDi   DONE    cpd4

In this case, since all three components are headwords, the analysis succeeds.

Note that the pfx1 analysis failed because that method would require a form X-Y or ati-manuzyabudDi, and would require manuzyabudDi to be a headword, which it is not.

The gloss in MW is 'having a superhuman intellect'; 'ati-manuzya' corresponds to 'superhuman' and 'budDi' to intellect, so 'atimanuzya-budDi' would be a more useful derivation; and it would be useful to have a dictionary entry 'atimanuyza' (key2=ati-manuzya) with meaning 'superhuman'.

funderburkjim commented 8 years ago

srs2

'srs' is an acronym for 'Simple Replacement Sandhi', (a term coined by Peter Scharf, I think); it is represented in our normalized coding of key2 by the '@' character. See devendra example above.

1   95518   deva    deva    m:f#I:n deva    DONE    noparts
3   96282   devendra    deve@ndra   m   deva+indra  DONE    srs2

In the srs2 analysis, the key2 spelling is split on the '@' character; if there is no '@' in key2, the analysis fails. In fact, key2 must split into two parts 'X@Y', where X and the parent P agree in all but the last character: X=Uv and P=Ur (P = deva, X = deve, U = dev, v = e, r=a; Y = ndra).

Now we want to find vowel 'z' so that the vowel sandhi of r+z = v ; typically there are two choices for 'z', since long and short vowels are indistinguishable after vowel sandhi. For our example, 'z' can be 'i' or 'I', since a+i->e, and a+I->e by simple vowel sandhi.

Now, we search for a headword spelled W ='zY' for either choice of 'z' (W='indra' or 'Indra', in our example).

If either (or both) values of W is a headword, the analysis succeeds, and we have a derivation P+W.

The program actually is slightly more general in analyzing W. Namely, we allow W to be a 'floating compound' (function floating_compounds). This means that we split W into parts on the compound separator character '-', resulting in components W1,...,Wn; then join these in all possible ways that result in headwords. For Example if W = a-b-c, then we search for 'abc', 'a' and 'bc', 'ab' and 'c', and 'a' and 'b' and 'c' as headwords and return as derivations any that succeed, returning as 'abc', 'a-bc', 'ab-c', 'a-b-c'.

In the devendra case, W has only 1 part (either 'indra' or 'Indra'), and only 'indra' is a headword. So the resulting derivation is deva+indra and the srs2 derivation suceeds.

Here is an example where there are 2 possible derivations - the Reason code is marked 'srs2?' to indicate the ambiguity of the derivation. Choosing between the derivations is beyond the scope of the analysis, and indeed I don't know how this may be done in a systematic way.

1   10  aMSa    aMSa    m   aMSa    DONE    noparts
3   33  aMSAMSa aMSA@MSa    m   aMSa+aMSa,aMSa+AMSa DONE    srs2?

Here is an example where 'W' had two parts:

3   1163    agnIzomIya  agnIzomIya  m:f:n   agnIzomIya  DONE    noparts
4   1167    agnIzomIyEkAdaSakapAla  agnIzomI~yE@kAdaSa-kapAla   m   
agnIzomIya+ekAdaSa-kapAla   DONE    srs2

funderburkjim commented 8 years ago

pfxderiv

This analysis is based on a set of words (in file auxiliary/pfxderiv.txt) based on two sources:

deriv.txt, a list of derivative forms from the book 'Roots, verb-forms and primary derivatives of the Sanskrit language' by Whitney, as digitized by Peter Scharf et. al., and
verb-prep4-gati2-complete.out, an analysis of prefixed verbs in MW.

For example: for ati-kram:

From:
2919    atikram ati-kram    kram    ati ati+kram
AND
kram    kram    krama kramin kramya kramaRa kramaRIya kramitavya krAma krAmin krAmya krAmaRa krAmuka krAMti krAMtf caNkrama caNkramaRa krAmayitavya
CONSTRUCT the record of pfxderiv.txt:
2919    atikram ati-kram    kram    ati ati+kram    atikrama atikramin atikramya atikramaRa atikramaRIya atikramitavya atikrAma atikrAmin atikrAmya atikrAmaRa atikrAmuka atikrAMti atikrAMtf aticaNkrama aticaNkramaRa atikrAmayitavya

So the list of words atikrama, atikramin, etc. are considered to be available explanations.

1   2919    atikram atikram VERB:K      NTD init
2   2931    atikramin   ati-kramin  m:f:n   atikramin   DONE    pfxderiv:ati+kram

The parent of atikramin is the prefixed verb atikram. In the list of prefixed derivatives for atikram we find atikramin. Thus the analysis succeeds and we note 'ati+kram' as a sub-note in the Reason field.

The reason that ati-kramin had not been previously analyzed with the pfx1 method is that 'kramin' is not a separate headword in MW. By contrast, ati-krama, ati-kramaRa and several others under ati-kram are analyzed by the pfx1 method, since krama, kramaRa, etc. are found as separate headwords in MW. So, in effect, in the pfxderv method, we are extending the headword list of MW to include all the Whitney derivatives, at least insofar as these derivatives appear in prefixed verb derivatives as headwords in MW.

funderburkjim commented 8 years ago

cpd5

This analysis uses the 'floating_compound' method mentioned above in connection with srs2. The parent word is unused. We split the word into parts based on the compound separator '-', and then partition the parts into sub-sequences, and select those partitions which lead to known headwords.

1   116890  parAmfS parAmfS VERB:K      NTD init
3   116901  parAmarSana parA-marSana    n   parA-marSana    DONE    cpd5

Since both parA and marSana occur as headwords, the cpd5 analysis succeeds.

It is interesting to investigate why other forms of analysis fail.
cpd1 fails since the parent is not 'parA', but rather parAmfS.

pfx1 fails since 'parA' is not included among the known prefixes (Is this omission an error in the list of known prefixes?)

pfxderiv fails due to an error in the digitization deriv.txt (just discovered in the course of these notes). Namely, under root 'mfS' (to touch), the derivative 'marSana' is misspelled as 'maSana'.

funderburkjim commented 8 years ago

We've now documented all the main analytical methods.

However, in an attempt to explain more entries, several variations to these methods were applied.

While these use the same primary reason codes (analytical methods) as just documented, they apply some pre/post processing; in the analysis2.txt these cases may be found by search for a plus sign '+' in the Reason field (last field). As of this writing, 3024 such cases are involved.

In the analysis2.py program, these variants are identified by one of five option codes:

m, cC, z, fauxcpd, removesfx

The next sections describe these options in turn.

funderburkjim commented 8 years ago

m option code

This applies only to cases where the headword ends in 'm'.

1   252 akAla   a-kAla  m   a-kAla  DONE    cpd_nan
3   265 akAlahInam  a-kAla-hInam    ind akAla-hIna  DONE    cpd1:+m

Here's how the option works. We first make a hypothetical entry by removing the final 'm' from both key1 and key2. (key1=akAlahIna, key2=a-kAla-hIna); then we cycle through the analytical methods in order. In this case, since 'akAla' is the parent, and 'hIna' is a headword, the cpd1 analysis succeeds on the hypothetical entry. We supply the resulting 'derivation' (akAla-hIna) and reason code (cpd1), and then affix the '+m' to the Reason code as a reminder that we had to apply this m-removal for the analysis to succeed.

funderburkjim commented 8 years ago

cC option code

This option applies only to headwords whose spelling has occasions of 'cC' . In such cases, the 'cC' is changed to 'C' in key1 and key2, and the normal analytical techniques are applied to the resulting hypothetical record. If this hypothetical record has an analysis, then the original record is analyzed identically, and the option '+cC' is affixed to the reason code.

1   1192    agra    agra    m:f:n   agra    DONE    noparts
3   1219.1  agracCada   agra-cCada  n   agra-Cada   DONE    cpd1:+cC

Presumably, there is a grammar rule which supports the 'cC' spelling in certain situations of compound formation.

funderburkjim commented 8 years ago

z option code

This option deals with recognition of a few sandhi changes that occur in the course of compound formation.

subcase 1. The first one involves the change of 's' to 'z' . After a compound separator (the '-' character) in key2, a 'z' is changed to 's' (and also 'zW' ->'sT', and 'zw' -> 'st'). This results in a new hypothetical key2; if one of the analytical methods recognizes this hypothetical key2, then the original headword is assumed to have the same analysis; and the indicatory '+z' is appended to the reason code.

1   890 agni    agni    m   agni    DONE    noparts
3   1091    agnizwut    agni-zwut   m   agni+stut   DONE    cpd1:+z

So in this case, agnizwut is a compound of the parent 'agi' and the headword 'stut'.

subcase 2 A second possibilty changes 'a-r' in key2 to 'a-f'.

2   1928    aja aja m   aja DONE    noparts
3   1991    ajarzaBa    aja-rzaBa   m   aja+fzaBa   DONE    cpd1:+a-r

subcase 3 A third possibility changes 'a-R' in key2 to 'a-n'

1   890 agni    agni    m   agni    DONE    noparts
3   1232    agraRIti    agra-RIti   f   agra+nIti   DONE    cpd1:+a-R

There are relatively few (36) of these. In general, the n-R sandhi does not cross a pada boundary in compounds ( for instance, agranaKa = agra-naKa, the 'n' of 'naKa' is unaffected in the compound by the presence of 'r' in the preceding pada). Probably there are one or more special sandhi rules that are used to explain the exceptions such as agraRIti.

NOTE: It might be the case that there are other sandhis that occur in compounds that have not so far been analyzed (status code = TODO) and for which additional subcases could be developed to provide analysis.

funderburkjim commented 8 years ago

fauxcpd option code

Although the distinction between the '-' and '~' characters in simplified key2 is typographically clear (see the scan snippets in a comment above), the distinction from the point of view of our analytical derivations is not so clear. That is, although our analytical algorithms make specific (and therefore clear) assumptions about the interpretation, it might be that looser assumptions could be useful.

This option deals with a fairly limited case of loosening this assumption. In particular, it only treats cases where there is a single '~' in key2 AND no '-' or '@'; i.e., key2 has the form 'X~Y' and X and Y are composed just of letters. In such a case, an analysis is made of the hypothetical key2 'X-Y' (change the '~' to '-'). If the hypothetical key2 is identified by analysis, then the original word is considered to be analyzed, and an indicatory '+fauxcpd' is appended to the reason code.

3   1163    agnIzomIya  agnIzomIya  m:f:n   agnIzomIya  DONE    noparts
4   1163.1  agnIzomIyanirvApa   agnIzomIya~nirvApa  m   agnIzomIya-nirvApa  DONE    cpd1:+fauxcpd

In this example, the scan actually shows a '-', rather than the degree-sign which '~' is supposed to represent; so, we could say that this example is a coding error in MW.

1   28925   indra   indra   m   indra   DONE    noparts
3   28985   indrajAla   indra-jAla  n   indra-jAla  DONE    cpd1
3   28993   indrajAlika indra~jAlika    m   indra-jAlika    DONE    cpd1:+fauxcpd

In this case, the scan actually does have the degree-sign, so the '~' coding is faithful to the printing. This probably indicates that indrajAlika is considered to be derived, by a secondary suffix 'ika' from indrajAla, rather than by derivation as a compound of indra and jAlika. However, the typography of the text tends to obscure the relation to indrajAla, in my opinion. The H3 coding accurately reflects the typography, but not the derivational relation, in this case.

NOTE: It should be possible to extend the fauxcpd option to such cases as X~Y-Z, X~Y~X, X-Y~Z, etc. No way currently of predicting how many new analyses woudl accrue.

funderburkjim commented 8 years ago

removesfx option code

Note that the indicatory sign here is +wsfx.

This option applies only to those cases where the headword (key1) spelling has the form XY, where Y is one of the known Whitney suffixes (as used in the wsfx method). If so, then there are two analytical possibilities: 1. The resulting 'X' is a headword. The Derivation is then X + Y, and the reason code is '+wsfx1:Y' 4 1154 agnIDra agnI@Dra m agnID+ra DONE +wsfx1:ra Based on Whitney's description of the 'ra' suffix (as a short form of the comparative 'tara'), I think this derivation of agniDra should be viewed with scepticism.

2. After removing Y from both key1 and key2, the resulting hypthetical record is successfully analyzed. Then +wsfx:Y is appended to the reason code.

1   1772    acakzus a-cakzus    n   a-cakzus    DONE    cpd_nan
2   1776    acakzuzka   a-cakzuzka  m:f:n   a-cakzuz+ka DONE    cpd_nan:+wsfx:ka

This analysis seems plausible. However, although the 'a-cakzuzka' form of key2 does agree with the printed text, wouldn't a better representation be acakzuz-ka (or maybe acakzuz~ka) ?

funderburkjim commented 8 years ago

I think all the analytical methods and optional codes have now been described in sufficient detail for this documentation comment, and will thus call this lengthy issue comment finished.

gasyoun commented 3 years ago

How could I ever forget this, @funderburkjim ? Only because you've done so many great things that I've lost half of them. Now, after almost 5 years, I must admit that it is of greatest interest still. Recently @Andhrabharati made me aware of http://sanskrit.jnu.ac.in/elearning/apte/shlexicon-lexicon.txt - it contains vyutpatti, derivation data based on Apte's dictionary that could be helpful, as it's not generated, but handmade around 70 years ago for the Sanskrit-Hindi edition of the Sanskrit-English dictionary. At https://groups.google.com/g/bvparishat/c/lMJQu3Zb_Vo I got https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/BU1.svg and https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/ that are hand-marked derivations based on Apte's data.

gasyoun commented 3 years ago

I'm looking at: https://github.com/funderburkjim/MWderivations/blob/master/step4/analysis2.txt

derivation For nouns, this indicates the derivation of the word. Empty for verbs.

For verbs with preverbs we could point out the connections? Because MW as I understand him does not state it in an explicit way and that is one of the reasons why there are no hyperlinks in between them.

TODO Nouns for which no derivation has been found.

5667 lines in 2021.

Keep in mind that some of the derivations may seem 'weak' upon close examination.

What if we compare them with Apte?

currently 220,265 records in these files. Note: There are 286812 records in mw.xml; all.txt skips records that are considered 'duplicates' in the context of this analysis2 investigation

Skips means they will have to tag at all? Or the tags given to the 220k records could be scaled to all 286k?

if the word is a special word (cardinal number or pronoun, type code starts with LEXID)

It's like an exclusion? So we have different sublists of exclusions?

ICF (in compound for word)

Without differentiating beginning or end?

Within the NTD group are 6851 NONE words; in some later stage of this work, these need to be examined and judicious enhancements made to MW to augment MW's omissions where possible.

@drdhaval2785 , please come out of the dark.

'~' I'm not sure what to call this. It occurs in a relatively few (16,000) cases in the current MW digitization. It represents an ellipsis character ° in the scans.

We had a discussion with Serge on it lately, remember, Jim?

Recognition of this as a source of misclassification led to the choice to do wsfx method first.

Have I ever told before that I absolutely love your approach?

This observation implies that it is possible that some of the remaining 8000 TODO cases may be due to a miscoding of key2.

Those 8000 TODO still remain 8000?

derivation field (third from last) shows what might be thought of as an 'improved' key2 for cpd1 cases; we may at some point want to include this improved key2 in the mw.xml file, as metadata that could be used to generate additional intra-headword links.

What I can do to move it forward? additional intra-headword links is one of the things I really miss a lot in my everyday routines.

Note: The identification of gender variants is rather primitive; it may be that some of the TODO cases with H-code of form '1B,2B,3B,4B' would succeed with a more sophisticated analysis.

This is where I ask you to take a look at my https://github.com/funderburkjim/MWderivations/issues/11

we include as an additional implied headword the feminine form adfQA, even though this is not an explicit headword in MW.

How many such alternate and non-existing entries were required?

The reason this failed the cpd1 analysis is that pAtajAta is not a headword in MW.

failed the cpd1 analysis, but succeeded with cpd3, right.

This is handled by including 'aty' in the list of known prefixes

Tried to find the full list, but failed at https://github.com/funderburkjim/MWderivations/blob/e887f3a62c6af1299f3605d3540b8b7f56ebd974/step3/auxiliary/pfxderiv.py#L18

What is the 'right way' to think about the derivation of akAla ? I'm not sure.

@drdhaval2785 , please come out of the dark.

Observation of the pfx2 cases shows that most of them occur when the parent is a prefixed root, and the child (nikAmana) is an H3. This is quite a different relation between parent and child than the prototypical cpd1 relation mentioned by Monier in his description of 4 lines of words. There is much more to say about this.

And this is where the biggest fun actually starts, getting closer to the dhātu and sopasarga dhātu relationship that I'm so interested in.

if we can know the derivation of a word in terms of other words, then we have a strong indirect affirmation of the spelling of those words.

Exactly!

Note that the pfx1 analysis failed because that method would require

Absolutely amazing and so needed documentation of not only why it worked, but why the previous steps did not.

where there are 2 possible derivations - the Reason code is marked 'srs2?' to indicate the ambiguity of the derivation. Choosing between the derivations is beyond the scope of the analysis, and indeed I don't know how this may be done in a systematic way.

@drdhaval2785 , please come out of the dark.

in the pfxderv method, we are extending the headword list of MW to include all the Whitney derivatives

It's a gem.

pfx1 fails since 'parA' is not included among the known prefixes (Is this omission an error in the list of known prefixes?)

@drdhaval2785 , please come out of the dark.

NOTE: It might be the case that there are other sandhis that occur in compounds that have not so far been analyzed (status code = TODO) and for which additional subcases could be developed to provide analysis.

@drdhaval2785 , please come out of the dark.

indrajAlika is considered to be derived, by a secondary suffix 'ika' from indrajAla, rather than by derivation as a compound of indra and jAlika. However, the typography of the text tends to obscure the relation to indrajAla, in my opinion. The H3 coding accurately reflects the typography, but not the derivational relation, in this case.

@drdhaval2785 , please come out of the dark.

NOTE: It should be possible to extend the fauxcpd option to such cases as X~Y-Z, X~Y~X, X-Y~Z

What is missing for that next step?

Based on Whitney's description of the 'ra' suffix (as a short form of the comparative 'tara'), I think this derivation of agniDra should be viewed with scepticism.

@drdhaval2785 , please come out of the dark.

This analysis seems plausible. However, although the 'a-cakzuzka' form of key2 does agree with the printed text, wouldn't a better representation be acakzuz-ka (or maybe acakzuz~ka) ?

@drdhaval2785 , please come out of the dark.

Andhrabharati commented 3 years ago

How could I ever forget this, @funderburkjim ? Only because you've done so many great things that I've lost half of them. Now, after almost 5 years, I must admit that it is of greatest interest still. Recently @Andhrabharati made me aware of http://sanskrit.jnu.ac.in/elearning/apte/shlexicon-lexicon.txt - it contains vyutpatti, derivation data based on Apte's dictionary that could be helpful, as it's not generated, but handmade around 70 years ago for the Sanskrit-Hindi edition of the Sanskrit-English dictionary. At https://groups.google.com/g/bvparishat/c/lMJQu3Zb_Vo I got https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/BU1.svg and https://scl.samsaadhanii.in/scl/dhaatupaatha/graphs/ that are hand-marked derivations based on Apte's data.

This Skt-Hindi dictionary is a translation of Apte1890 Skt-English dictionary, need to see if that is the larger Practical ed. or the somewhat smaller Student's ed., looking closer.

And this has some extra details added (like this vyutpatti, and some annexures) benefitting the students, as the intro pages said.

This work was taken up as a competitor to Apte1957 (Prasad's revised ed.), by MLBD.

Also they (MLBD) had annexed some 10000 entries to Apte90 Practical ed., in their reprint, which I guess could be added to your Apte90 data.

Andhrabharati commented 3 years ago

This is the Student's ed. of Apte90, and has the annexure of 10000 new entries to MLBD Apte90 Practical ed., included in the Hindi translation.