drdhaval2785 / SanskritSorting

Codes written by Dr. Dhaval Patel for Sanskrit Natural Language Programming
2 stars 1 forks source link

#asūy sorted above #as #42

Closed gasyoun closed 9 years ago

gasyoun commented 9 years ago

@drdhaval2785 please help. http://gasyoun.github.io/296.txt is the IAST source file, based on that I wanted to sort http://gasyoun.github.io/296-SLP1.txt

not only is a again split of because of some unknown and unseen spaces, but it's #as sorted above #asūy that is even stranger.

| a |
#ac — 137#
#añj — 79#
#aṇ — 118, 144#
#atī — 138, 238#
#apekṣ — 172#

|  |
#abhī — 230#

| a |
#arc — 42#
#arj — 53, 59, 170#
#arth — 156, 187#
#arṣ — 138#
#arh — 95, 199#
#ar — 42, 53, 59, 95, 122, 138, 156, 170, 187, 199#
#avekṣ — 80, 208, 218#
#aś — 140#
#asūy — 225#
#as — 60, 71, 125, 139, 143, 200, 210, 225#
gasyoun commented 9 years ago

Same with

kam
kar
kart
karś
karṣay
karṣ
kalpay
kalp
kāṅkṣ
kās
kḷp
kram
krīḍ
klid
kliś
kṣi
kṣip
kṣubh

Full list:

ac 
aYj 
aR 
an 
atI 
apekz 
aBI 
arc 
arj 
arT 
arz 
arh 
ar 
avekz 
aS 
asUy 
as 

Ap 
As 

inD 
iz 
i 

Ikz 
Ir 
IS 

ukz 
upapre 
upAs 
upeta 
upe 

Uh 

ej 
e 

f 

kam 
kart 
karS 
karzay 
karz 
kar 
kalpay 
kalp 
kANkz 
kAs 
kxp 
kram 
krIq 
klid 
kliS 
kzip 
kzi 
kzuB 

KyA 

gaR 
gad 
gam 
garD 
gar 
gAh 
gA 
gup 
guh 
gras 
grah 
glA 
glE 

Garz 
GAtaya 
Guz 

cakz 
car 
cal 
cit 
cint 
ci 
cud 
cezw 
coday 
cyu 

Cad 
Cid 
Cri 

jan 
jar 
jAgar 
ji 
jIv 
juz 
jYA 
jval 

takz 
tan 
tapasya 
tap 
tark 
tard 
tarp 
tfp 
tF 
tar 
tij 
tuz 
tras 
tvar 

TA 

dam 
darS 
dah 
dA 
diS 
dIp 
duHK 
du 
dyut 
dru 
dviz 

Dam 
DmA 
Dar 
DAv 
DA 

nad 
nand 
nam 
naS 
nind 
niveday 
nI 

paw 
paW 
pat 
pad 
parI 
pF 
par 
palAy 
pAr 
pAl 
pA 
piz 
pIq 
pU 
pyE 
pracC 
pratIkz 
pratIz 
pratyujjIv 
prI 
pretya 
prer 
prez 
pre 

banD 
bAD 
buD 
brU 

Bakz 
Baj 
BaYj 
Bar 
BAvay 
BAz 
BA 
Bid 
BI 
Buj 
BUz 
BU 
BraNS 

majj 
maRq 
maT 
mad 
mantray 
mantr 
manT 
man 
marj 
marS 
mar 
mahIya 
mAnay 
mA 
mil 
miz 
muc 
mud 
muh 

yaj 
yam 
yA 
yuj 
yu 

rakz 
rac 
raB 
ram 
rah 
ric 
riz 
ruc 
ruj 
rud 
ruD 
ruh 
ru 

lakzay 
lakz 
lag 
lap 
laB 
lamb 
laz 
liK 
lI 
lup 
lokay 
lok 
loc 

vac 
vaYc 
vad 
vaD 
vart 
varD 
varz 
var 
vas 
vAYC 
vA 
vid 
vind 
viS 
viDA 
vI 
vep 
ve 
vyaT 
vyaD 
vyAdiS 
vraj 
vraSc 
vrIq 

Saṁs 
Sak 
SaNs 
SabdAy 
Sabd 
Sam 
SAs 
Siz 
SI 
Suc 
SuD 
SumB 
Suz 
Sozay 
Sram 
Sri 
Sliz 
Svas 

zad 
zic 
ziD 
zo 
zWA 
zvaj 

sac 
sad 
saparya 
samutTA 
sarj 
sarp 
sar 
sah 
sAD 
sAntvay 
sA 
siD 
sic 
su 
sU 
sf 
so 
sev 
star 
sTA 
snA 
snih 
sparS 
spfS 
smar 
syand 
sru 
svad 
svid 
svaj 

han 
hary 
harz 
hf 
har 
has 
hA 
hu 
hvar 
hvA 
drdhaval2785 commented 9 years ago

This is not an issue. Closing. Noting the 'abhI' issue separately.

gasyoun commented 9 years ago

Not an issue why, please explain. Because of https://github.com/drdhaval2785/SanskritSorting/issues/20#issuecomment-59347959 because of We are not sorting by k, K, g etc. We sort according to ka Ka ga ?

drdhaval2785 commented 9 years ago

http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc/servepdf.php?page=0002 it shows the following ordering.

capture

After every possible vowel after 'ak' is over, they start 'akk'.

This was the original reason why we chose the vowels and consonants separately.

But in the present discussion, there has been an interesting ramification - In the same dictionary on http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc/servepdf.php?page=0001 The ordering of aMS precedes aMSa. capture

Hence the need to reopen this issue.

If prima facie looks like the words ending with halanta (aMS) precede those ending with halanta+vowel+... (aMSa etc) But the words having halantas in between follow the words having consonant (akkA follows akOSala)

gasyoun commented 9 years ago

aMS should come before aMSa - have no reason to think otherwise. Neither I have seen it differently in dictionaries. After every possible vowel after 'ak' is over, they start 'akk' means letter by letter. Can we have it?

drdhaval2785 commented 9 years ago

$c[$i] = preg_replace('/(\\\u[0-9abcdef]{4})(\\\u094d)$/','!$1$2',$c[$i]); is the regex added to overcome this hurdle. Will have to look for potential side-effects

gasyoun commented 9 years ago

So it's ready for testing?

drdhaval2785 commented 9 years ago

Yes

gasyoun commented 9 years ago

Some strange side-effects occur.

ac
aṇ
an
ar
aś
as
añj
atī
apekṣ
abhī
arc
arj
arth
arṣ
arh
avekṣ
asūy

Now as is far away from asūy - that makes sense even less.

drdhaval2785 commented 9 years ago

Tried in latest commit.

a- (6),
| a |
#aṁś#
#aṁśa#
#aṁśya#
#as#
#asa#
#asūy#

This is the output. This time I leave it for you to check and close the issue