prefix and suffix data in MW key2

drdhaval2785 commented 8 years ago

Per https://github.com/drdhaval2785/samasasplitter/issues/2#issuecomment-166070828 @gasyoun wants a list of prefix and suffix in MW for his purpose. Try to make a small script for the same. May come useful for the splitter also.

e.g. a or A would be prefixoids in most of the cases. Right now we are ignoring the single letter parts. But I guess we can allow them in prefixes.

Code modification is not that easy.

drdhaval2785 commented 8 years ago

Per Gasyoun

Not quite. a- is a prefix (=preverb). A real one. prefixoid - (linguistics) A wordinitial segment that does not have all characteristics of a prefix. Pseudoprefix - it's never a real one. /aMSa from the above list is a prefixoid - many samasas start with it. Same would be akzi, aDara, aneka and hundreds more.

gasyoun commented 8 years ago

Let me explain, @drdhaval2785 and let me @funderburkjim if I'm clear enough. There are prefixes and prefixoids. I would love to know the stats - which prefixes are more popular. That vi (2450) is 5 times less used than sam (12820).

vinAma  vi  nAma        
vinAmaka    vi  nAmaka      
vinAmikA    vi  nAmikA      
vinAyaka    vi  nAyaka      
vinAyakacaturTI vi  nAyaka  caturTI 
vinAyakacaturTIvrata    vi  nAyaka  caturTI vrata
vinAyakacarita  vi  nAyaka  carita  

samunnamana sam unnamana    
samunnaya   sam unnaya  
samunnasa   sam unnasa  
samunnAha   sam unnAha  
samunnidra  sam unnidra 
samunmajj   sam un  majj
samunmiSra  sam unmiSra 

niryUza nir yUza
niryUha nir yUha
niryogakzema    nir yoga
niryola nir yola
nirlakzaRa  nir lakzaRa
nirlakzya   nir lakzya
nirlajja    nir lajja
nirlajjatA  nir lajja
nirlayanI   nir layanI
nirlavaRa   nir lavaRa

upasfjya    upa sfjya   
upasfta upa sfta    
upasftavat  upa sfta    vat
upasftya    upa sftya   
upasfpta    upa sfpta

Everything that is left after we filter off prefixed words (words that have an upasarga, प्र, परा, अप, सम्‌, अनु, अव, निस्‌, निर्‌, दुस्‌, दुर्‌, वि, आ (आङ्‌), नि, अधि, अपि, अति, सु, उत् /उद्‌, अभि, प्रति, परि तथा उप or one of it's variations) - and longer than 1 element (not sure what a word like yuga is doing in this samasas file, but such words we do not need either).

So

cittaBU citta   BU  
cittaBUmi   citta   BUmi    
cittaBeda   citta   Beda    
cittaBrama  citta   Brama   
cittaBramacikitsA   citta   Brama   cikitsA
cittaBrAnti citta   BrAnti  

citrakuRqala    citra   kuRqala 
citrakuzWa  citra   kuzWa   
citrakUwa   citra   kUwa    
citrakUwamAhAtmya   citra   kUwa    mAhAtmya
citrakUwayAtrA  citra   kUwa    yAtrA

Have citta and citra as prefixoids - many words are built using such a model.

vrAtamaya   vrAta   maya
vrIhimaya   vrIhi   maya
Sakamaya    Saka    maya
Saktimaya   Sakti   maya
SaNkAmaya   SaNkA   maya

ekaSilA eka SilA
ekaSIla eka SIla
ekAntaSIla  ekAnta  SIla
evaMSIla    evaM    SIla
evaMSIlasamAcAra    evaM    SIla
kamalaSIla  kamala  SIla
karmaSIla   karma   SIla
kAkaSilA    kAka    SilA

Have maya and SIla as suffixoids.

Such list of building blocks (with up to 3 samples per case) would be an appendix to the Reverse dictionary, that could be used by linguists.

drdhaval2785 / samasasplitter

prefix and suffix data in MW key2 #4