dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Sharing Wikipedia dump as well as csv extraction file #25

Open xiaohan2012 opened 10 years ago

xiaohan2012 commented 10 years ago

Hi,

Any one has the newest csv extraction file together with the Wikipedia dump?

I am having trouble setting up the summary extraction stuff and the dump for the csv summary(year 2011) is not there any more(seems Wikipedia removes the older versions regularly).

If any one has those(at least newer than 2011, including 2011), can you share it somewhere?

rom1504 commented 9 years ago

If anybody is still interested I did the extraction with the dump of 03/04/2015. It took about 10 days on a single node.

http://download.rom1504.fr/enwiki-20150403-csv.tar.gz https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

rom1504 commented 9 years ago

Btw if you only need to extract plain text from wikipedia, use this instead https://github.com/attardi/wikiextractor it's much faster and easier to use. (You don't need to load a big index when using it or anything like that)

Neuw84 commented 9 years ago

Did you built the database for that dump? I was getting problems with the pages-articles file (bzip) that I solved using a new version of the library and now this exception:

15/05/05 16:31:25 INFO db.MarkupDatabase: Loading markup database: 421.08% in 01:32:22, ETA 00:00:-4226
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1735)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1606)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1644)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:558)
    at org.wikipedia.miner.db.MarkupDatabase.loadFromXmlFile(MarkupDatabase.java:111)
    at org.wikipedia.miner.db.WEnvironment.buildEnvironment(WEnvironment.java:738)
    at org.wikipedia.miner.util.EnvironmentBuilder.main(EnvironmentBuilder.java:30)

Seems to be related to: https://bugs.openjdk.java.net/browse/JDK-7156085 http://a-sirenko.blogspot.com.es/2013/08/jdk-7-sax-parser-produces.html

rom1504 commented 9 years ago

I built the database (that generated a 49GB db/ directory btw) but then when I tried to use it, it tells me it needs 2 hours to load the database, and it needs a lot of memory. At that point I decided to try something else than wikipedia miner instead. But still I think my dump works, if you're prepared to wait a pretty long time every time you want to load the database.

I only used wikipedia miner 1.2, no idea whether the recent commits are better than 1.2 or not.

Neuw84 commented 9 years ago

My database dump of December 2012 is about 42GB, then yours should be ok. The problem of loading time can be reduced not caching everything, I have Spanish and English Wikipedia instances running with 7,5 GB of RAM both and get enough speed for my needs. For example for English I only cache the label database using the space option. You can work with the library without waiting for the caching using the API or the Rest services (or just modify the line in the web that looks whether the cache loading has been finished).

In my fork, I have fixed some bugs that I have encountered, but nothing related to speed or memory.

Btw thanks for reporting that it works.

vshvedov commented 9 years ago

@rom1504 how did you managed to build the db? There is a weird situation: the architecture in this repo differs from an archived 1.2 version I download from http://wikipedia-miner.cms.waikato.ac.nz

And also, would it be possible to upload that db? I can provide an ftp access for sharing.

Thank you.

rom1504 commented 9 years ago

@vshvedov I used the 1.2 version, and followed https://github.com/dnmilne/wikipediaminer/wiki/Installing-the-java-api#build-the-berkeley-database to build the db. As said by email I can send you the db but you should be able to build it ;)

amirj commented 9 years ago

@rom1504 thank you for sharing. It seems that something is wrong with 'translations.csv' file. After importing cvs files into the database, calling 'Article.getTranslations()' ALWAYS return nothing. Is it possible to check the following code: Article art = wikipedia.getArticleByTitle("Iran"); System.out.println(art.getTranslations().size());

The result is 0 in my installation.

apohllo commented 9 years ago

The problem is that WM is based on an old assumption, that interlingual links are still present in the Wikipedia dump. For some time they are (mostly) stored in Wikidata. They are also present in dumps, but in different file (one of SQL dumps). WM would have to be updated to produce the correct data regarding translations.

ali3assi commented 8 years ago

@rom1504 please how can i get please the following file https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

The link cited above is not working.

Thank you in advance for your help

xiaohan2012 commented 8 years ago

@TamouzeAssi , I think certain old dumps will expire and become unavailable. You can use newer dumps.