IshidaMotohiro / RMeCab

Interface to MeCab
30 stars 10 forks source link

How to generate a user-defined dictionary? #12

Closed kang37 closed 3 years ago

kang37 commented 3 years ago

Dear Ishida san, こんにちわ. Thank you for your kindly reply before about the RMeCabFreq function, and sorry to disturb you again.

I am doing some text analysis, and I found that by the docDF function and default dictionary, some terms could be separated into several words that out of my expectation. For example, I want to keep "地球温暖化" in my segmentation result, but the word will be separated to "地球", "温暖" and "化". To solve this problem, a user-defined dictionary is required. According to the guidance in RMeCab site, I went to P58 in Rによるテキストマイニング入門石田基広著. 森北出版, 2008. However, I found the code didn't work on my computer. Here is the information about my system and code.

System Windows 10 Pro, 64-bit operating system, x64-based processor.

*Prepare .csv File* Firstly, prepare the motohiro.csv file mentioned in the textbook: open App Notepad of windows; write the required text "基広,-1,-1,1000,名詞,固有名詞,人名,名,,,基広,モトヒロ,モトヒロ" on the Notepad, and save as "motohiroansi.csv" in C:\data, the Encoding was set as "ANSI" when saving the document. Since there can be some problem of encoding, so I also saved a .csv file with encoding of utf-8, the details are: open App Notepad of windows; write the required text "基広,-1,-1,1000,名詞,固有名詞,人名,名,,,基広,モトヒロ,モトヒロ" on the Notepad, and save as "motohiroutf.csv" in C:\data, the Encoding was set as "UTF-8" when saving the document.

Code Open Windows command prompt, and input the code: firstly, change the directory to where I stored the mecab-dict-index.exe document, and it goes well: cd C:\Program Files\MeCab\bin then I typed the following code:

mecab-dict-index.exe -d "C:\Program Files\MeCab\dic\ipadic" -u ishida.dic -f shift-jis -t shift-jis C:\data\motohiroansi.csv

and it says:

reading C:\data\motohiroansi.csv ... 1 emitting double-array: 100% |###########################################| dictionary.cpp(500) [bofs] permission denied: ishida.dic

Then I tried the *.csv file of utf-8 encoding:

mecab-dict-index.exe -d "C:\Program Files\MeCab\dic\ipadic" -u ishida.dic -f utf-8 -t shift-jis C:\data\motohiroutf.csv

and it says that again:

reading C:\data\motohiroutf.csv ... 1 emitting double-array: 100% |###########################################| dictionary.cpp(500) [bofs] permission denied: ishida.dic

I wonder if it is a common problem, and how can I solve it? Thank you.

Besides, since there can sometimes be encoding problems in Windows, I tried R in Mac recently. And I wonder how to add a user-defined dictionary in Mac? The guidance for user-defined dictionary on RMeCab site has been expired.

kang37 commented 3 years ago

Got it done for Windows system! It turns out that the reason is about administration permission in Windows system. "Administration permission" is required when one tries to delete a document in "C:\Program Files\MeCab"; and I guess it is also required when you want to add a file into the directory - and that, could (possibly) be the reason why it shows "dictionary.cpp(500) [bofs] permission denied: ishida.dic" after I input the code and tried to create a new dictionary (see the post above). So this time, I copied all the file "MeCab" (including all the files inside) out of "C:\Program Files" and moved it to "C:\data"; and I tried the code:

C:\data\MeCab

to change the working directory, and then:

mecab-dict-index.exe -d "C:\data\MeCab\dic\ipadic" -u ishida.dic -f shift-jis -t shift-jis C:\data\motohiroansi.csv

then it says:

reading C:\data\motohiroansi.csv ... 1 emitting double-array: 100% |###########################################| done!

Later I found the file ishida.dic" in "C:\data\MeCab". And I tried it with thedocDFfunction and theRMeCabC` function in R - it works well! Thank you for the package :D

However, I haven't figure out how to solve that in Mac.

IshidaMotohiro commented 3 years ago

thank you for your message. Since the problem in Windows seems to be solved, I am writing about the case on Mac.

if you installed MeCab from source file (mecab_0.966.tar.gz), and your user dic (my.dic) is saved in /Users/myname/Documents (notice: my.csv must be saved in UTF-8 encoding), launch the terminal app and enter following commands

$ cd ~/Documents
$ # to confirm wheher you have your csv file here.
$ ls 
  # my.csv 
$ # now build your custom dictionary
$ /usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic -u my.dic -f utf-8 -t utf-8 my.csv 

if you installed MeCab with hombew, the last line have to be replaced with

/usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic -u my.dic -f utf-8 -t utf-8 my.csv

I hope this helps.

kang37 commented 3 years ago

Hi Ishida san, thank you for your kindly help. I made it with your information! I hadn't realized that the message of adding a user-defined dictionary on the website is for Mac until I saw your information here. Sorry I should have just followed the message on the website. Thank you very much.