Open xiaohan2012 opened 10 years ago
If anybody is still interested I did the extraction with the dump of 03/04/2015. It took about 10 days on a single node.
http://download.rom1504.fr/enwiki-20150403-csv.tar.gz https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2
Btw if you only need to extract plain text from wikipedia, use this instead https://github.com/attardi/wikiextractor it's much faster and easier to use. (You don't need to load a big index when using it or anything like that)
Did you built the database for that dump? I was getting problems with the pages-articles file (bzip) that I solved using a new version of the library and now this exception:
15/05/05 16:31:25 INFO db.MarkupDatabase: Loading markup database: 421.08% in 01:32:22, ETA 00:00:-4226
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1735)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1606)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1644)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:558)
at org.wikipedia.miner.db.MarkupDatabase.loadFromXmlFile(MarkupDatabase.java:111)
at org.wikipedia.miner.db.WEnvironment.buildEnvironment(WEnvironment.java:738)
at org.wikipedia.miner.util.EnvironmentBuilder.main(EnvironmentBuilder.java:30)
Seems to be related to: https://bugs.openjdk.java.net/browse/JDK-7156085 http://a-sirenko.blogspot.com.es/2013/08/jdk-7-sax-parser-produces.html
I built the database (that generated a 49GB db/ directory btw) but then when I tried to use it, it tells me it needs 2 hours to load the database, and it needs a lot of memory. At that point I decided to try something else than wikipedia miner instead. But still I think my dump works, if you're prepared to wait a pretty long time every time you want to load the database.
I only used wikipedia miner 1.2, no idea whether the recent commits are better than 1.2 or not.
My database dump of December 2012 is about 42GB, then yours should be ok. The problem of loading time can be reduced not caching everything, I have Spanish and English Wikipedia instances running with 7,5 GB of RAM both and get enough speed for my needs. For example for English I only cache the label database using the space option. You can work with the library without waiting for the caching using the API or the Rest services (or just modify the line in the web that looks whether the cache loading has been finished).
In my fork, I have fixed some bugs that I have encountered, but nothing related to speed or memory.
Btw thanks for reporting that it works.
@rom1504 how did you managed to build the db? There is a weird situation: the architecture in this repo differs from an archived 1.2 version I download from http://wikipedia-miner.cms.waikato.ac.nz
And also, would it be possible to upload that db? I can provide an ftp access for sharing.
Thank you.
@vshvedov I used the 1.2 version, and followed https://github.com/dnmilne/wikipediaminer/wiki/Installing-the-java-api#build-the-berkeley-database to build the db. As said by email I can send you the db but you should be able to build it ;)
@rom1504 thank you for sharing. It seems that something is wrong with 'translations.csv' file. After importing cvs files into the database, calling 'Article.getTranslations()' ALWAYS return nothing. Is it possible to check the following code: Article art = wikipedia.getArticleByTitle("Iran"); System.out.println(art.getTranslations().size());
The result is 0 in my installation.
The problem is that WM is based on an old assumption, that interlingual links are still present in the Wikipedia dump. For some time they are (mostly) stored in Wikidata. They are also present in dumps, but in different file (one of SQL dumps). WM would have to be updated to produce the correct data regarding translations.
@rom1504 please how can i get please the following file https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2
The link cited above is not working.
Thank you in advance for your help
@TamouzeAssi , I think certain old dumps will expire and become unavailable. You can use newer dumps.
Hi,
Any one has the newest csv extraction file together with the Wikipedia dump?
I am having trouble setting up the summary extraction stuff and the dump for the csv summary(year 2011) is not there any more(seems Wikipedia removes the older versions regularly).
If any one has those(at least newer than 2011, including 2011), can you share it somewhere?