drdhaval2785 / samasasplitter

Repository to split samAsa (compounds) in Sanskrit
4 stars 3 forks source link

Huet's lessons #1

Open gasyoun opened 8 years ago

gasyoun commented 8 years ago

First of all I wanted to salute you with https://github.com/drdhaval2785/samasasplitter. How quick, but even now it can handle sandhi! Huet's and @mbykov have done a lot in the field lately and hope will comment. @funderburkjim is out of the game, but not reason not to listen what he thinks of it.

80k word frequency included in https://github.com/gasyoun/SanskritLexicography/blob/0fb80a8de652e80eb5514d930289c0cc0588d85b/DCS_statistical_evaluation.htm It's parsed from http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=corpus and contains part of MW. Of less interest but still might be https://github.com/gasyoun/SanskritLexicography/blob/0fb80a8de652e80eb5514d930289c0cc0588d85b/DCS-Moniers-roots-w-references.html Please also see https://docs.google.com/document/d/11Z1snnew9a0eY96W5o-ZQ71Zve1WRjcOqfOFgagndy4/edit#heading=h.k0dxemsx30hk - questions I had after reading Gérard's emails:

Gérard Huet 08.02.14:

I have currently two kinds of suffix entries in my lexicon. 
Some are phonemic affixes used to indicate morphology, such as -na (even when it undergoes retroflexion when affixing).
Others are Paninian technical terms referring to generative morphology parameters, themselves often having little overlap with the final phonemic increment,
such as -cvi for inchoative compounds, -ktva for the -tva taddhita suffix under context condition k for constructing neuter abstracts of quality.
My goal is to replace progressively the approximate suffixes by more precise etymology indication, stating unambiguously the affixing operation.
I did this for k.rdanta constructions, at least completely for participles. This allowed me to replace eg
for \word{samucita} the approximate \desf{samuc}{-ita} by the precise \ppde{samuc} where my keyword ppde means (in French!) 
"passive past participle of". Thus I could give a unique scheme for all pp's, in -ta, -ita, -na, or whatever.
I also want to separate k.rt and taddhita suffixes. The latter are very numerous, and their productivity is unclear and non uniform.
I have worked out lately how to extend my machinery for automatic recognition of certain taddhita forms in order to parse long navya-nyaaya compounds.
Actually, an hour ago Arjuna, a student at UoH, just presented at SALA the result of joint research on this experiment, so I am very much into suffixes these days. 
If you are interested, you may play with the new "experimental" mode in my just released new V2.80 engine. 
In the reader page, set "Experiment" for Parser strength and "Word" for text, and you'ill be able to parse compounds such as
hewuwAvacCexakAvacCinnahewvaXikaraNawAprawiyogikahewuwAvacCexakasambanXAvacCinnAXeyawAnirUpiwaviSeRaNawAviSeRasambanXena (in WX input).

Gérard Huet 21.03.14:

What should the entries of a dictionary be ?
In French, it is easy. Entries are all bare stems of words. Which are assumed to be in finite number. Plus a few exotic inflected forms, such as "yeux", the plural of "œil". For verbs, conjugated forms are not listed. There is a special book for conjugation, the Bescherelle, that lists all conjugation schemes and the verbs that belong to their class. Thus such a notion as "the longest word in the French language" makes sense, amazingly enough, it is "anticonstitutionnellement".
You have to work out for yourself that it is the adverb in -ment corresponding to the adjectif "anticonstitutionnel", obtained by prefixing the opposite prefix anti-
to the adjectif "constitutionnel", itself the adjectif in -el giving the quality of substantive  "constitution", itself the verbal action in -tion corresponding to
verb "constituer" (itself being obtained from pre-verb con- in front of an ancient "stituer" coming from Latin). 
This point of view is just ignoring the productive nature of morphology. For instance, a few years ago, a political figure used the word "bravitude" instead of
correct "bravoure" (braveness), and she was mocked as ignorant, even though "bravitude" is morphologically correct.
Now in Sanskrit we have to take care of productive morphology. For compounds, of course, but also for simple words, obtainable by complex morphological processes. This makes sense, since the grammar is very explicit about morphological formation, albeit in a specially complex way, using phonological
processes such as gu.na/v.rddhi and sandhi. The problem is how to reflect this information in a lexicon. Should be list pratyaayas? Does it make sense
to list such pratyaayas in lexicographic order, in reverse order, in frequency order, in whatever order ? Look at "kvasu", which is discussed in the
Harkare-PratyayaKosha Issues that you pointed out to me. It is not an aadeza, at least not of the kind that make alternate the bhuu/as roots.
It is a k.rt pratyaaya. It is used for forming the stem of the perfect participle, such as vidvas from root vid. This is stated here. :-)
Now if you look at the etymology of my entry vidvas, you see:  विद्वस् vidvas [ppft. vid_1] 
and not [k.rd(vid_1,kvasu)]. Note that here you need the whole grammar to tell you that ultimately vid-kvasu will compute into vidvas.
The "k" is not phonetic material, as part of some hypothetical morpheme "kvasu". It is a control argument to the computational process. 
Thus I indicate entry "kvasu" just as help for someone who want to understand what this notion stands for in Paa.nini's grammar, but I keep implicit
in my etymological indication that ppft stem computation corresponds to k.rt kvasu, this is only needed for grammar specialists. Indeed even in my computer code
I do not use "kvasu", and the ppft stems are computed by cascades of morpho-phonetical processes which are not easily encodable into a simple notation.
Indeed often my participles are an abstraction over several pratyaaya affixes. You may look at the appendix of my COLING paper that tells in painful detail
how my computation of the future participle stem accounts for the set of suffixes {yat,kyap,.nyat}. This issue is complex. Paa.nini's grammar is a whole,
it is not a simply-connected set of modules. And it cannot be used stand-alone, you need the appropriate dhatupatha, and the ga.napatha as well. 
I have not studied Harkare's book, but it appears to me as a specialised Vyaakara.na work, assuming fine knowledge of all this grammar material,
and I do not see how to simply interleave it with a lexicon.
Take for instance :  काठकः indicated as mysterious in your Harkare issues pages. Harkare mentions suutra IV-2-46. If you look at this suutra, you find:
"After names of Vedic schools, (the suffixes that are valid to designate a collection of objects) are the same as the ones that denote a rule (relative to the relevant school, as an extension to suutras IV,3,126 ff.)" where I put in parens what is implicit from the anuv.rtti. 
Now it should be obvious that this is simply an example, corresponding to the school Ka.tha and stipulating that ka.thaka, denoting a rule of this school,
denotes also the adepts of the school of sage Ka.tha, author of Kaṭhopaniṣad. 
Personally, I would not venture in this Harkare book without the help of a pandit or at least of a scholar who has completely mastered Paninian processes and their nomenclature. It is like trying to understand a contemporary mathematics article without the appropriate training. 

@drdhaval2785 I would go for: 3.1 Frequency 3.2 sanhw2 occurance 3.3 Word length (DEC) 3.4 Alphabetic order

gasyoun commented 7 years ago

@drdhaval2785 are there any new lessons learned in this regard? Please let me know.

gasyoun commented 5 years ago

@drdhaval2785 it's dead, I understand. But should it be left so?

drdhaval2785 commented 5 years ago

I have felt need for it in a current project. So hopefully it will not remain dead for long.

gasyoun commented 3 years ago

So hopefully it will not remain dead for long.

Please take a look at https://github.com/funderburkjim/MWderivations/issues/14 I need to present a paper on it and need help on this 4000 word list, thanks.

drdhaval2785 commented 3 years ago
python split.py batchprocess/input.txt MW batchprocess/output.txt

From code, it looks that this should work.