Open gasyoun opened 3 years ago
If we use the 'k2' field in mw_iast.txt, we may get samAsas with more padas. For instance, there appear to be 28 with 6 (or more) padas:
28 matches for "k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]" in buffer: mw5_iast.txt
241578:<L>71110.1<pc>1326,3<k1>catuḥṣaṣṭyupacāramānasapūjāstotrastotra<k2>catuḥ—ṣaṣṭy-upacāra-mānasa-pūjā-stotra-stotra<e>3
263720:<L>77908<pc>415,1<k1>jaladharagarjitaghoṣasusvaranakṣatrarājasaṃkusumitābhijña<k2>jalá—dhara—garjita-ghoṣa-susvara-nakṣatra-rāja-saṃkusumitābhijña<e>4
339722:<L>100997<pc>515,2<k1>dhāraṇīmukhasarvajagatpraṇidhisaṃdhāraṇagarbha<k2>dhāraṇī—mukha-sarva-jagat-praṇidhi-saṃdhāraṇa-garbha<e>3
376723:<L>111830<pc>567,3<k1>nṛganṛpatipāṣāṇayajñayūpapraśasti<k2>nṛ́—ga—nṛpati-pāṣāṇa-yajña-yūpa-praśasti<e>4
394367:<L>116900.28<pc>590,3<k1>parāmarśapūrvapakṣagranthadīdhitiṭīkā<k2>parā-marśa—pūrva-pakṣa-grantha—dīdhiti-ṭīkā<e>4
471798:<L>140161.46<pc>708,2<k1>prāyaścittaśatadvayīśatadvayīprāyaścitta<k2>prāyaś—citta—śata-dvayī—śata-dvayī-prāyaścitta<e>4
502934:<L>149622.40<pc>751,3<k1>bhāgavatapurāṇabhāvārthadīpikāprakaraṇakramasaṃgraha<k2>bhāgavata—purāṇa—bhāvārtha-dīpikā-prakaraṇa-krama-saṃgraha<e>4
523740:<L>156063<pc>780,2<k1>madhuvanavrajavāsigosvāmiguṇaleśāṣṭaka<k2>mádhu—vana—vraja-vāsi-go-svāmi-guṇa-leśāṣṭaka<e>4
536895:<L>160055<pc>797,3<k1>mahāpuruṣavidyāyāṃviṣṇurahasyekṣetrakāṇḍejagannāthamāhātmya<k2>mahā́—puruṣa—vidyāyāṃ viṣṇu-rahasye kṣetra-kāṇḍe jagan-nātha-māhātmya<e>4
599587:<L>179090<pc>886,1<k1>rūpakavirājagosvāmiguṇaleśasūcakāṣṭaka<k2>rūpá—kavi-rāja-go-svāmi-guṇa-leśa-sūcakāṣṭaka<e>3
599614:<L>179099<pc>886,1<k1>rūpagosvāmiguṇaleśasūcakanāmadaśaka<k2>rūpá—go-svāmi-guṇa-leśa-sūcaka-nāma-daśaka<e>4
627784:<L>187922.1<pc>927,1<k1>varṣartumāsapakṣāhovelādeśapradeśavat<k2>varṣá—rtu—māsa-pakṣāho-velā-deśa-pradeśa-vat<e>4
663084:<L>198692<pc>980,1<k1>vimalaprabhāsaśrītejorājagarbha<k2>vi-mala—prabhāsa-śrī-tejo-rāja-garbha<e>3
674539:<L>202059<pc>997,2<k1>viṣayalaukikapratyakṣakāryakāraṇabhāvarahasya<k2>viṣaya—laukika-pratyakṣa-kārya-kāraṇa-bhāva-rahasya<e>3
693853:<L>207959<pc>1027,1<k1>vaiśvānarapathikṛtapūrvakadarśasthālīpākaprayoga<k2>vaiśvānará—pathi-kṛta-pūrvaka-darśa-sthālī-pāka-prayoga<e>3
742167:<L>223065.2<pc>1099,2<k1>śrīnivāsabrahmatantraparakālasvāmyaṣṭottaraśata<k2>śrī—nivāsa—brahma-tantra-para-kāla-svāmy-aṣṭottara-śata<e>4
742716:<L>223232<pc>1100,1<k1>śrīvatsamuktikanandyāvartalakṣitapāṇipādatalatā<k2>śrī—vatsa—muktika-nandy-āvarta-lakṣita-pāṇi-pāda-tala-tā<e>4
755282:<L>226988<pc>1120,2<k1>saṃsarpaddhvajinīvimardavilasaddhūlīmaya<k2>saṃ-sarpad-dhvajinī-vimarda-vilasad-dhūlī-maya<e>4
758619:<L>227991<pc>1125,3<k1>saṃkaṣṭaharacaturthīvratakālanirṇaya<k2>saṃ-kaṣṭa—hara-caturthī-vrata-kāla-nirṇaya<e>3
764153:<L>229615.32<pc>1134,3<k1>satpratipakṣapūrvapakṣagranthadīdhitiṭīkā<k2>sát—pratipakṣa—pūrva-pakṣa-grantha-dīdhiti-ṭīkā<e>4
771283:<L>231718<pc>1145,1<k1>saṃdhivigrahayānadvaidhībhāvasamāśrayagrantha<k2>saṃ-dhí—vigraha—yāna-dvaidhībhāva-samāśraya-grantha<e>4
779682:<L>234183<pc>1160,1<k1>samādhiyogarddhitapovidyāviraktimat<k2>sam-ādhi—yoga-rddhi-tapo-vidyā-virakti-mat<e>3
789136:<L>237001<pc>1179,2<k1>sambhūtabhūrigajavājipadātisainya<k2>sam-bhūta—bhūri-gaja-vāji-padāti-sainya<e>3
793439:<L>238311.05<pc>1185,3<k1>sarvatathāgatadharmavāṅniṣprapañcajñānamudrā<k2>sárva—tathāgata—dharma-vāṅ-niṣprapañca-jñāna-mudrā<e>4
794006:<L>238449<pc>1186,2<k1>sarvapāparogaharaśatamānadāna<k2>sárva—pāpa-roga-hara-śata-māna-dāna<e>3
798196:<L>239649.25<pc>1191,2<k1>savyabhicārapūrvapakṣagranthadīdhitiṭīkā<k2>sa—vyabhicāra—pūrva-pakṣa-grantha-dīdhiti-ṭīkā<e>4
840757:<L>252810<pc>1250,1<k1>somadevaśrīkaralālabhairavapurapati<k2>sóma—deva—śrī-kara-lāla-bhairava-pura-pati<e>4
857023:<L>257854.1<pc>1275,2<k1>svacchandabhaṭṭārakabṛhatpūjāpattrikāvidhi<k2>svá—cchanda—bhaṭṭā-raka-bṛhat-pūjā-pattrikā-vidhi<e>4
28 with 6 (or more) padas
Thanks, I want to add stats for each dictionary one day. Like how many in MW are:
k2>.[—-].[—-].[—-].[—-].[—-].[—-]
Elisp Regex syntax? Could not emulate in EmEditor.
Elisp Regex syntax?
No, AFAIK, this is a 'normal' regex syntax.
Here is a python program which does the same thing:
#-*- coding:utf-8 -*-
"""filter.py
"""
from __future__ import print_function
import sys, re,codecs
if __name__=="__main__":
filein = sys.argv[1] # xxx.txt (path to digitization of xxx)
fileout = sys.argv[2] # results of filter
regexraw = "k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-]"
regex = re.compile(regexraw)
matches = []
with codecs.open(filein,"r","utf-8") as f:
for iline,line in enumerate(f):
line = line.rstrip('\r\n')
m = re.search(regex,line)
if m != None:
matches.append(line)
# write the matches
with codecs.open(fileout,"w","utf-8") as fout:
for line in matches:
fout.write(line+'\n')
print(len(matches),"matches found")
Run the program in a terminal:
python filter.py mw.txt filter.txt
You get 28 matches found, and filter.txt contains the examples.
Give it a try!
Note: This program could be a model for many similar investigations, just by changing the 'regexraw'
No, AFAIK, this is a 'normal' regex syntax.
It failed because I searched in mw.xml
, I believed it's rather identical to mw.txt
, which it was not in this regard. It works now in EmEditor (my default text editor) and Notepad ++ quite well with mw.txt
, thanks for the hint.
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-] 1 entry 10 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-] 1 entry 9 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-].*[—-] 11 entries 8 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-].*[—-] 28 entries 7 padas
k2>.*[—-].*[—-].*[—-].*[—-].*[—-] 122 entries 6 padas
k2>.*[—-].*[—-].*[—-].*[—-] 555 entries 5 padas
k2>.*[—-].*[—-].*[—-] 3532 entries 4 padas
k2>.*[—-].*[—-] 30591 entries 3 padas
k2>.*[—-] 182449 entries 2 padas
MW-2-pada-samasa-list-182449-entries.txt MW-3-pada-samasa-list-30591-entries.txt MW-4-pada-samasa-list-3532-entries.txt MW-5-pada-samasa-list-555-entries.txt MW-6-pada-samasa-list-122-entries.txt MW-7-pada-samasa-list-28-entries.txt MW-8-pada-samasa-list-11-entries.txt MW-9-pada-samasa-list-1-entry.txt MW-10-pada-samasa-list-1-entry.txt
These files have some funny characters in them, that are not in mw.txt
The funny characters appear to be related to the 'long dash' character in mw.txt.
Could this be an encoding problem (maybe utf-8 not in force in editor) ?
Could this be an encoding problem (maybe utf-8 not in force in editor) ?
Looks like.
Is
daśa—śata—kara-dhārin mfn. thousand-rayed (the moon),
the longest samāsa?How based on https://github.com/funderburkjim/MWderivations/blob/master/compounds/compounds.txt could I count it, @funderburkjim or is this data lost an such a level?