Shreeshrii / hindi-hunspell

Hindi wordlists, dictionary and affix files in hunspell format
40 stars 18 forks source link

How to make a hunspell file for sanskrit? #1

Open vvasuki opened 7 years ago

vvasuki commented 7 years ago

Suppose that I have a Sanskrit pada list in devanAgarI. How do I make a hunspell file? Detailed step by step instructions would be very helpful. नमः पार्वतीपतये 🙏

Shreeshrii commented 7 years ago

It has been a while since I worked on hunspell. I will take a look and get back.

It will help if you post a short sample of the list that you have.

Ref: old issues filed with hunspell

https://github.com/hunspell/hunspell/issues/10

https://github.com/hunspell/hunspell/issues/221

Shreeshrii commented 7 years ago
  1. Install hunspell

  2. Read documentation about hunspell - http://manpages.ubuntu.com/manpages/trusty/man4/hunspell.4.html

  3. Modify https://github.com/Shreeshrii/hindi-hunspell/blob/master/hi-common.aff for Sanskrit deleting nuktas etc

  4. Take a look at https://github.com/Shreeshrii/hindi-hunspell/blob/master/hi-verbs.aff and https://github.com/Shreeshrii/hindi-hunspell/blob/master/hi-verb1.dic

SFX 1 is rule 1 for suffixes in hi-verbs.aff and all the verb roots with /1 after them (as in hi-verb1.dic) when combined with these rules form valid words.

# verb forms for aa-kaaranta kriyaa-pada eg. बुलवा, गिरवा, खा, गा, तिलमिला (exception कराके)
SFX 1 Y 18
SFX 1 0 ई .
SFX 1 0 कर [^करा]
SFX 1 0 ना .
SFX 1 0 ने .

eg. उठा/1 + all the rules says that following words will be valid

उठाई उठाकर उठाना उठाने

and so on

Shreeshrii commented 7 years ago
# verb forms for ii-kaaranta kriyaa-pada eg  पी, जी, सी
SFX 3 Y 17
SFX 3 0 कर .
SFX 3 0 ना .
SFX 3 0 ने .
SFX 3 0 ता .
SFX 3 0 ती .
SFX 3 0 ते .
SFX 3 ी िए ी
SFX 3 ी िएगा ी
SFX 3 ी िएगी ी
SFX 3 ी िएँगे ी
SFX 3 ी िऊँ ी
SFX 3 ी िऊँगा ी
SFX 3 ी िऊँगी ी
SFX 3 ी िआ ी
SFX 3 ी िए ी
SFX 3 ी िला ी
SFX 3 ी िलवा ी

This shows an example of how the base form will get changed - eg.

SFX 3 ी िए ी means पी changes to पिए SFX 3 ी िएगा ी means जी changes to जिएगा

Shreeshrii commented 7 years ago

So basically, you will need to create affix rules for different roots eg. from पठ् you need rules to get पठति
पठतः
पठन्ति पठसि
पठथः
पठथ पठामि
पठावः
पठामः etc

so your dic file will have

पठ्/1

and your affix file will have

SFX 1 Y NN

SFX 1 ् ति ् SFX 1 ् तः ् SFX 1 ् न्ति ्

where NN will be the number of rules for SFX 1.

Shreeshrii commented 7 years ago

affixcompress: dictionary generation from large (millions of words) vocabularies and munch: dictionary generation from vocabularies (it needs an affix file, too).

can be used to brute force create dictionary file from a large vocablist, which will create random rules.

vvasuki commented 7 years ago

Ah I see, thanks for the tips so far. I want to start from a large vocabulary file - every subanta and ti~Nanta word (such as पठति पठतः…), and get a dictionary out of it. If I understand correctly, I should run affixcompress and munch in sequence?

Shreeshrii commented 7 years ago

https://groups.google.com/forum/#!topic/freetamilcomputing/dEQgHESN9us

different methods to create an affix file for an agglutinative language, like Tamil.

Shreeshrii commented 7 years ago

You can try the following:

LC_ALL=hi_IN.UTF-8 sort sanskrit-words.txt | LC_ALL=hi_IN.UTF-8 uniq > hin LC_ALL=hi_IN.UTF-8 affixcompress hin 5000

where

sanskrit-words.txt is the sanskrit words file. output files will be san.aff san.dic

The number 5000 is default, you can change to larger/smaller to get an optimized size for .dic and .aff files.

Shreeshrii commented 7 years ago

Change hin to san in the above.

affixcompress creates the .aff and .dic files.

munch may not work with hunspell files.

Shreeshrii commented 7 years ago

munch is supposed to take a wordlist and apply the affix rules to it to create a dictionary file. However, I get an error when running it - maybe you can troubleshoot the hash table related bug.

Please see https://github.com/hunspell/hunspell/issues/470

Attached is a start of sanskrit related rules in san.aff.txt affix file. san.aff.txt

Shreeshrii commented 7 years ago

I have created a sample file that can be used for sanskrit dictionary - it has a very small wordlist but can be used as proof of concept.

Please see https://github.com/Shreeshrii/hindi-hunspell/blob/master/dict-sa_IN.zip

you can add your larger/complete .aff and .dic files in a folder like https://github.com/Shreeshrii/hindi-hunspell/tree/master/dict-sa_IN and then zip it.

gasyoun commented 7 years ago

different methods to create an affix file

Dhaval has already made an affix file in the past, but not in hunpsell format. Add him here to be able to ask him, please, @Shreeshrii

Shreeshrii commented 7 years ago

@vvasuki I had contacted Gérard Huet for using the dictionary resources at "Héritage du Sanskrit" for creating the Sanskrit hunspell files a couple of years or more ago. However because of problems running the hunspell programs with Devanagari, I had not posted the results.

I have updated

https://github.com/Shreeshrii/hindi-hunspell/blob/master/dict-sa_IN.zip and the dic and aff files in https://github.com/Shreeshrii/hindi-hunspell/tree/master/dict-sa_IN

with one version of the files.

Please also see https://github.com/hunspell/hunspell/issues/470#issuecomment-282509768 for a suggested algorithm for building the affix and dictionary files from a wordlist.

If there are any python programmers in the sanskrit programmers group who want to tackle that.

Shreeshrii commented 7 years ago

@gasyoun I do not have Dhaval's handle on github to add here.

vvasuki commented 7 years ago

I had contacted Gérard Huet for using the dictionary resources at "Héritage du Sanskrit" for creating the Sanskrit hunspell files a couple of years or more ago.

That's close to what I had in mind for an initial crude dictionary! So you are using programatically generated declensions (nouns roughly) and conjugations (tiNanta-s - roughly verbs)?

However because of problems running the hunspell programs with Devanagari, I had not posted the results.

Are these problems now gone?

Shreeshrii commented 7 years ago

Are these problems now gone?

affixcompress uses gawk for building the aff files. The program crashes when using very large devanagari wordllists. Don't know whether the limitation is because of memory on my PC.

Shreeshrii commented 7 years ago

Another limitation with using either these lists or a word frequency lists is that they don't take into account sandhi and samas rules, so all those compound words will show up as spelling errors.

Shreeshrii commented 7 years ago

Dhaval has already made an affix file in the past, but not in hunpsell format. Add him here to be able to ask him.

Adding @drdhaval2785

Shreeshrii commented 7 years ago

@vvasuki I used http://linguae.stalikez.info/ to export headwords and csv from the dictionary files and then used sed to just get the devanagari words from it. I am testing with different combination of lists to see the max possible to do without getting an error.

vvasuki commented 7 years ago

affixcompress uses gawk for building the aff files.

http://www.suares.com/index.php?page_id=25&news_id=233 suggests that the affix file can well be empty! why not try with just a huge sorted word list as a dic file?

Another limitation with using either these lists or a word frequency lists is that they don't take into account sandhi and samas rules, so all those compound words will show up as spelling errors.

I thought of that as well - this can be eliminated to some extant for words ending with visarga : just add रामो रामस् for रामः , and रामा for रामाः, and हरिर् हरिस् for हरिः. At a later stage, one can combine them with words starting with vowels, somehow tell hunspell to treat ऽ as space etc..

Shreeshrii commented 7 years ago

I was able to generate files by converting to iast and back to Devanagari. I will upload them tomorrow. The files are pretty large, 45 mb for Devanagari. But these have random affix rules, not based on grammar.

You can see the iast version uploaded to the repo.

A better option will be to define the affix rules for all roots, which are probably already available with those who have worked on Sanskrit grammar.

On 26-Feb-2017 9:18 PM, "Vishvas Vasuki" notifications@github.com wrote:

affixcompress uses gawk for building the aff files.

http://www.suares.com/index.php?page_id=25&news_id=233 suggests that the affix file can well be empty! why not try with just a huge sorted word list as a dic file?

Another limitation with using either these lists or a word frequency lists is that they don't take into account sandhi and samas rules, so all those compound words will show up as spelling errors.

I thought of that as well - this can be eliminated to some extant for words ending with visarga : just add रामो रामस् for रामः , and रामा for रामाः, and हरिर् हरिस् for हरिः. At a later stage, one can combine them with words starting with vowels, somehow tell hunspell to treat ऽ as space etc..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/hindi-hunspell/issues/1#issuecomment-282564949, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4zNGoV9MvwGHgXIwn__cn0SunwQks5rgZ7CgaJpZM4MK5bz .

Shreeshrii commented 7 years ago

https://github.com/Shreeshrii/hindi-hunspell/tree/master/sa-Latn

See above for iast.aff and iast.dic, for Sanskrit hunspell files in IAST transliteration.

Shreeshrii commented 7 years ago

@vvasuki I have uploaded the Devanagari version of the files. Please see

https://github.com/Shreeshrii/hindi-hunspell/tree/master/sa-Deva

Commands used for building - mainly the affixcompress utility.

rm *.tmp
rm word
rm word2
rm san
cat   sample.txt  freq.txt dict-d.txt dict-g.txt dict-mw.txt A1.txt A2.txt A3.txt B.txt > dict.txt
LC_ALL=hi_IN.UTF-8 sort dict.txt | LC_ALL=hi_IN.UTF-8 uniq > san
LC_ALL=hi_IN.UTF-8 affixcompress san 5000
cat deva.aff.txt san.aff >sa_IN.aff
mv san.dic  sa_IN.dic
Shreeshrii commented 7 years ago

You can give these files a try and follow similar process with your wordlist.

Shreeshrii commented 7 years ago

See the Nepali language hunspell dictionary for a grammar based dictionary

http://packages.ubuntu.com/zesty/hunspell-ne

gasyoun commented 7 years ago

can be used to brute force create dictionary file from a large vocablist, which will create random rules.

You've started with it yourself as well, right?

The program crashes when using very large devanagari wordllists.

Like 400k words?

Shreeshrii commented 7 years ago

Yes, I experimented a bit in Feb in response to Vishwas's query. The final files are in this repo. I am not working on it now.

Hunspell project is supposed to come out with a better tool in v2.0, but not sure when it will happen.

Shreeshrii commented 7 years ago

Looks like I deleted many of the files/folders referenced in discussion above in one of the updates, probably in this commit https://github.com/Shreeshrii/hindi-hunspell/commit/fd06f9c84839973b47868c5fa864dae39a27938b

I think the reason was that the spell-check files were becoming very large and were too slow to use eg. with notepad++

gasyoun commented 3 years ago

I deleted many of the files/folders referenced in discussion above in one of the updates

Bad news for Sanskrit hunspell as well.

Shreeshrii commented 3 years ago

It maybe possible to get them from https://github.com/Shreeshrii/hindi-hunspell/commit/fdc8508fe2d08dea5c91e52c36a692db320a2ae8