funderburkjim / MWderivations

Derivations of headwords in the Monier-Williams (1899) dictionary
1 stars 1 forks source link

Longest samāsa in MW #12

Open gasyoun opened 3 years ago

gasyoun commented 3 years ago

Is daśa—śata—kara-dhārin mfn. thousand-rayed (the moon), the longest samāsa?

How based on https://github.com/funderburkjim/MWderivations/blob/master/compounds/compounds.txt could I count it, @funderburkjim or is this data lost an such a level?

funderburkjim commented 3 years ago

If we use the 'k2' field in mw_iast.txt, we may get samAsas with more padas. For instance, there appear to be 28 with 6 (or more) padas:

28 matches for "k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]" in buffer: mw5_iast.txt
 241578:<L>71110.1<pc>1326,3<k1>catuḥṣaṣṭyupacāramānasapūjāstotrastotra<k2>catuḥ—ṣaṣṭy-upacāra-mānasa-pūjā-stotra-stotra<e>3
 263720:<L>77908<pc>415,1<k1>jaladharagarjitaghoṣasusvaranakṣatrarājasaṃkusumitābhijña<k2>jalá—dhara—garjita-ghoṣa-susvara-nakṣatra-rāja-saṃkusumitābhijña<e>4
 339722:<L>100997<pc>515,2<k1>dhāraṇīmukhasarvajagatpraṇidhisaṃdhāraṇagarbha<k2>dhāraṇī—mukha-sarva-jagat-praṇidhi-saṃdhāraṇa-garbha<e>3
 376723:<L>111830<pc>567,3<k1>nṛganṛpatipāṣāṇayajñayūpapraśasti<k2>nṛ́—ga—nṛpati-pāṣāṇa-yajña-yūpa-praśasti<e>4
 394367:<L>116900.28<pc>590,3<k1>parāmarśapūrvapakṣagranthadīdhitiṭīkā<k2>parā-marśa—pūrva-pakṣa-grantha—dīdhiti-ṭīkā<e>4
 471798:<L>140161.46<pc>708,2<k1>prāyaścittaśatadvayīśatadvayīprāyaścitta<k2>prāyaś—citta—śata-dvayī—śata-dvayī-prāyaścitta<e>4
 502934:<L>149622.40<pc>751,3<k1>bhāgavatapurāṇabhāvārthadīpikāprakaraṇakramasaṃgraha<k2>bhāgavata—purāṇa—bhāvārtha-dīpikā-prakaraṇa-krama-saṃgraha<e>4
 523740:<L>156063<pc>780,2<k1>madhuvanavrajavāsigosvāmiguṇaleśāṣṭaka<k2>mádhu—vana—vraja-vāsi-go-svāmi-guṇa-leśāṣṭaka<e>4
 536895:<L>160055<pc>797,3<k1>mahāpuruṣavidyāyāṃviṣṇurahasyekṣetrakāṇḍejagannāthamāhātmya<k2>mahā́—puruṣa—vidyāyāṃ viṣṇu-rahasye kṣetra-kāṇḍe jagan-nātha-māhātmya<e>4
 599587:<L>179090<pc>886,1<k1>rūpakavirājagosvāmiguṇaleśasūcakāṣṭaka<k2>rūpá—kavi-rāja-go-svāmi-guṇa-leśa-sūcakāṣṭaka<e>3
 599614:<L>179099<pc>886,1<k1>rūpagosvāmiguṇaleśasūcakanāmadaśaka<k2>rūpá—go-svāmi-guṇa-leśa-sūcaka-nāma-daśaka<e>4
 627784:<L>187922.1<pc>927,1<k1>varṣartumāsapakṣāhovelādeśapradeśavat<k2>varṣá—rtu—māsa-pakṣāho-velā-deśa-pradeśa-vat<e>4
 663084:<L>198692<pc>980,1<k1>vimalaprabhāsaśrītejorājagarbha<k2>vi-mala—prabhāsa-śrī-tejo-rāja-garbha<e>3
 674539:<L>202059<pc>997,2<k1>viṣayalaukikapratyakṣakāryakāraṇabhāvarahasya<k2>viṣaya—laukika-pratyakṣa-kārya-kāraṇa-bhāva-rahasya<e>3
 693853:<L>207959<pc>1027,1<k1>vaiśvānarapathikṛtapūrvakadarśasthālīpākaprayoga<k2>vaiśvānará—pathi-kṛta-pūrvaka-darśa-sthālī-pāka-prayoga<e>3
 742167:<L>223065.2<pc>1099,2<k1>śrīnivāsabrahmatantraparakālasvāmyaṣṭottaraśata<k2>śrī—nivāsa—brahma-tantra-para-kāla-svāmy-aṣṭottara-śata<e>4
 742716:<L>223232<pc>1100,1<k1>śrīvatsamuktikanandyāvartalakṣitapāṇipādatalatā<k2>śrī—vatsa—muktika-nandy-āvarta-lakṣita-pāṇi-pāda-tala-tā<e>4
 755282:<L>226988<pc>1120,2<k1>saṃsarpaddhvajinīvimardavilasaddhūlīmaya<k2>saṃ-sarpad-dhvajinī-vimarda-vilasad-dhūlī-maya<e>4
 758619:<L>227991<pc>1125,3<k1>saṃkaṣṭaharacaturthīvratakālanirṇaya<k2>saṃ-kaṣṭa—hara-caturthī-vrata-kāla-nirṇaya<e>3
 764153:<L>229615.32<pc>1134,3<k1>satpratipakṣapūrvapakṣagranthadīdhitiṭīkā<k2>sát—pratipakṣa—pūrva-pakṣa-grantha-dīdhiti-ṭīkā<e>4
 771283:<L>231718<pc>1145,1<k1>saṃdhivigrahayānadvaidhībhāvasamāśrayagrantha<k2>saṃ-dhí—vigraha—yāna-dvaidhībhāva-samāśraya-grantha<e>4
 779682:<L>234183<pc>1160,1<k1>samādhiyogarddhitapovidyāviraktimat<k2>sam-ādhi—yoga-rddhi-tapo-vidyā-virakti-mat<e>3
 789136:<L>237001<pc>1179,2<k1>sambhūtabhūrigajavājipadātisainya<k2>sam-bhūta—bhūri-gaja-vāji-padāti-sainya<e>3
 793439:<L>238311.05<pc>1185,3<k1>sarvatathāgatadharmavāṅniṣprapañcajñānamudrā<k2>sárva—tathāgata—dharma-vāṅ-niṣprapañca-jñāna-mudrā<e>4
 794006:<L>238449<pc>1186,2<k1>sarvapāparogaharaśatamānadāna<k2>sárva—pāpa-roga-hara-śata-māna-dāna<e>3
 798196:<L>239649.25<pc>1191,2<k1>savyabhicārapūrvapakṣagranthadīdhitiṭīkā<k2>sa—vyabhicāra—pūrva-pakṣa-grantha-dīdhiti-ṭīkā<e>4
 840757:<L>252810<pc>1250,1<k1>somadevaśrīkaralālabhairavapurapati<k2>sóma—deva—śrī-kara-lāla-bhairava-pura-pati<e>4
 857023:<L>257854.1<pc>1275,2<k1>svacchandabhaṭṭārakabṛhatpūjāpattrikāvidhi<k2>svá—cchanda—bhaṭṭā-raka-bṛhat-pūjā-pattrikā-vidhi<e>4
gasyoun commented 3 years ago

28 with 6 (or more) padas

Thanks, I want to add stats for each dictionary one day. Like how many in MW are:

  1. 1 pada entries
  2. 2 pada entries
  3. 3 pada entries
  4. 4 pada entries
  5. 5 pada entries
  6. 6 pada entries
  7. 7 pada entries - no such?

k2>.[—-].[—-].[—-].[—-].[—-].[—-]

Elisp Regex syntax? Could not emulate in EmEditor.

funderburkjim commented 3 years ago

Elisp Regex syntax?

No, AFAIK, this is a 'normal' regex syntax.

Here is a python program which does the same thing:

#-*- coding:utf-8 -*-
"""filter.py

"""
from __future__ import print_function
import sys, re,codecs

if __name__=="__main__":
 filein = sys.argv[1] #  xxx.txt (path to digitization of xxx)
 fileout = sys.argv[2] # results of filter
 regexraw = "k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]"
 regex = re.compile(regexraw)
 matches = []
 with codecs.open(filein,"r","utf-8") as f:
  for iline,line in enumerate(f):
   line = line.rstrip('\r\n')
   m = re.search(regex,line)
   if m != None:
    matches.append(line)
 # write the matches   
 with codecs.open(fileout,"w","utf-8") as fout:
  for line in matches:
   fout.write(line+'\n')
 print(len(matches),"matches found")

Run the program in a terminal: python filter.py mw.txt filter.txt

You get 28 matches found, and filter.txt contains the examples.

Give it a try!

Note: This program could be a model for many similar investigations, just by changing the 'regexraw'

gasyoun commented 3 years ago

No, AFAIK, this is a 'normal' regex syntax.

It failed because I searched in mw.xml, I believed it's rather identical to mw.txt, which it was not in this regard. It works now in EmEditor (my default text editor) and Notepad ++ quite well with mw.txt, thanks for the hint.

k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]   1 entry     10 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]     1 entry     9 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]           11 entries  8 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]             28 entries  7 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-]               122 entries 6 padas
k2>.*[—-].*[—-].*[—-].*[—-]                 555 entries 5 padas
k2>.*[—-].*[—-].*[—-]                       3532 entries    4 padas
k2>.*[—-].*[—-]                         30591 entries   3 padas
k2>.*[—-]   182449 entries  2 padas

MW-2-pada-samasa-list-182449-entries.txt MW-3-pada-samasa-list-30591-entries.txt MW-4-pada-samasa-list-3532-entries.txt MW-5-pada-samasa-list-555-entries.txt MW-6-pada-samasa-list-122-entries.txt MW-7-pada-samasa-list-28-entries.txt MW-8-pada-samasa-list-11-entries.txt MW-9-pada-samasa-list-1-entry.txt MW-10-pada-samasa-list-1-entry.txt

funderburkjim commented 3 years ago

These files have some funny characters in them, that are not in mw.txt image

The funny characters appear to be related to the 'long dash' character in mw.txt.

Could this be an encoding problem (maybe utf-8 not in force in editor) ?

gasyoun commented 3 years ago

Could this be an encoding problem (maybe utf-8 not in force in editor) ?

Looks like.