dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Installing Wikipedia Miner with Chinese(language) dump #19

Open wsj14847 opened 10 years ago

wsj14847 commented 10 years ago

I know how to install the Wikipedia Miner, and it works when I use the English dump.

Now I try to install Wikipediaminer with Chinese dump. When I had finished to deploy the web services, I can access the service as normal, but it doesn't return any links for Chinese words.

My steps as below: Step1: I have get the zhwikisource-20140308-pages-articles-multistream.xml.bz2 from http://dumps.wikimedia.org Step2: build wikipedia-miner.jar(Same as English) Step3: Change the language.xml. (Q1) Language code="zh" name="Chinese" localName="中文" RootCategory is changed to 分类 Step4: Run the Dump Extractor and get summaries in /final Step5: Run ant build-database and get jdb files Step6: Create WAR, and deploy to tomcat.

Maybe I did something wrong in the installing specially for Chinese, could you please tell me the reason? Thank you. (Q1) Is it right? (Q2) I found some files with suffix ".model" in /models/compare(annotate), and they are for en,de. Is it the key point? If yes, could you tell me where I can get the files for Chinese, or how I create the files, which tools and the method.

Thank you very much!

apohllo commented 10 years ago

Have you provided Chinese sentence splitting rules?

hadoop jar wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor input/enwiki-latest-pages-articles.xml input/languages.xml en input/en-sent.bin output

This is the last option - it should be input/zh-sent.bin. But this file is not available with Wikipedia miner - you have to find it somewhere on the Internet.

The other problem might be the locale at your machine. Are you sure it is set to correct (Chinese) values?

wsj14847 commented 10 years ago

Thank you for your comment. I didn't provide Chinese sentence, just use the en-sent.bin instead. I will find the zh-sent.bin and try it again.

The locale is Chinese at my machine, it's correct.

And is Q2 a problem?

apohllo commented 10 years ago

Sorry, I don't know. I only use the Wikipedia Miner extraction framework.