drdhaval2785 / SanskritSorting

Codes written by Dr. Dhaval Patel for Sanskrit Natural Language Programming
2 stars 1 forks source link

Split | क | to || तक ||, || नक ||, || बक || subheaders #1

Open gasyoun opened 9 years ago

gasyoun commented 9 years ago

For input we have SLP1. Output will be IAST, not devanagari. In reverse can we add the option, to get noticed (in a subheader) when the 2nd last matra from the end changes as well? If sorted only by last matra some letters will take 10-100 pages and it will be hard and long to find the needed word in printed book. Cases like कोशा-तक, दम-नक should be easy, ligatures should not be broken (?) च-म्पक. आम्रा-तक भल्ला-तक कोशा-तक दम-नक च-म्पक कुर-बक कुर-बक सैरी-यक

Instead of ordering just by | क | to have || तक ||, || नक ||, || बक || subheaders.

drdhaval2785 commented 9 years ago

who decides upto which level we want this? There will be many smaller groups with 5-6 members as well.

gasyoun commented 9 years ago

5-6 members does not sounds that bad. 2-3 members is worse. I can't predict, I just do not know. 2nd level for sure. No idea if that will help and be enough. Needs testing. What else can I say :)

drdhaval2785 commented 9 years ago

at least some reasonable demand should be there.

  1. Should we be sorting only the words which end with 'क', 'ख' etc?
  2. If yes, should we also split आक इक ईक etc or keep only to तक बक मक यक etc?
  3. when the header is a consonant (lets say त्‌ ) should we also sort like अत्‌ आत्‌ इत्‌ ?
  4. When the header is a vowel (lets say आ) should be sort का खा गा ? What is the expected output ?
gasyoun commented 9 years ago
  1. To have a clear answer https://github.com/drdhaval2785/SanskritSorting/issues/15 would be wanting. The non 'क', 'ख' subgroups seem to be small and so no additional meaning is splitting them at all.
  2. Depends on the same stats, but I would say rather split आक इक ईक
  3. Not sure what would need a change, would love to know your opinion as well. Pure consonant and pure vowel chapters are small by default and do not need no additional splitting.
  4. As per №3, to should be sort का खा गा I would say no.

| क | || तक || || नक || || बक || | ख | || खक || || खक ||

drdhaval2785 commented 9 years ago

15 is closed now. So 1,2,3,4 can be decided.

gasyoun commented 9 years ago

22022222

There are 367 headwords right now. Some are big groups:

ya 16448 ra 14075 ka 13253 ta 12797 na 11676 n 9438

Some are smaller, but more than 1-5 pages long in the printed book: va 6371 ṇa 6303 la 4965 ti 4836 t 4453 kā 4350 s 3488 tā 3420 da 3142 ma 3135 ga 2981 m 2904 ha 2410 ṣa 2401 pa 2310 śa 2293 tha 2277 sa 2151 ṭa 2131 dha 2129 ja 1933 yā 1890 nī 1676 d 1597 rī 1557 dhi 1323

So what I'm thinking is about next level subheadings. I would go the तक बक मक यक way. What do you say?

drdhaval2785 commented 9 years ago

@gasyoun I don't prefer double standards. Don't distinguish between small groups and big groups.

gasyoun commented 9 years ago

I do not. What I mean is that some are that big that one should be made aware of it before he gets there. तक बक मक यक way sounds reasonable.