Implementation of the Wikipedia Dump Reader

dbpedia / topicmodel-extractor

A repository for the "Combining DBpedia and Topic Modeling" GSoC 2016 idea

13 stars 4 forks source link

Implementation of the Wikipedia Dump Reader #1

Open dav009 opened 8 years ago

dav009 commented 8 years ago

Hi @wojtuch I'm part of the dbpedia-spotlight community and I follow your project out of pure curiosity.

There are a bunch of wikipedia dump parsers. some of them are quite outdated/painful to use. a student from previous gsoc used a lot fo time on the same step.

If you find yourself stuck in this step consider taking a look at jsonwikipedia or at wikistats extractor (it is used by dbpedia-spotlight), they might already extract/contain the information you need for your project

wojtuch commented 8 years ago

Hi @dav009, thanks for your input!

As a matter of fact, I was struggling with Sweble a little bit so I thought I could maybe use wikiforia to preprocess the articles and drop the wiki syntax but after talking to my mentor I put this effort on hold -- for the midterm evaluation I will mine the topics on the abstracts only anyway so as for now it made no sense to invest more time in setting up the wikipedia dump reader.

But for the final deliverable of the project, I will definitely have a closer look at what you proposed!

dav009 commented 8 years ago

So you are requiering the text out of the articles without any wikiboilerplate?

dav009 commented 8 years ago

I was struggling with a similar need( getting text without boilerplate) and I just did a gigantic regex cleaning :

in https://github.com/idio/wiki2vec the first step is generating a "cleaned" wikipedia version, each article is dumped in a single line, links are replaced by DBPEDIA/Wikititle. i.e: [[A|B]] will be replaced by A DBPEDIA/B

If that shape is useful, You can generate one of those corpus it will only take a few mins, or I can also give you some of the ones I have for old wikipedias.

Jsonwikipedia also gives you a "stream" of paragraphs, which are boilerplate clean. but out of the box you have no way to aggregate the paragraphs belonging to the same article (if that is important for your implementation)

jimkont commented 8 years ago

if you want only abstracts you can re-use the existing abstracts extracted from DBpedia, abstracts do not change so often to have your own pipeline (or use the DBpedia abstract extraction you cannot get cleaner than that)

Since a couple of release back DBpedia uses https://www.mediawiki.org/wiki/Extension:TextExtracts which is supported from Wikipedia and is also available live e.g. https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=1&explaintext=&titles=DBpedia

Other than that for smaller scale

wojtuch commented 8 years ago

@jimkont: For the texts of abstracts I just parsed the n-triple file long_abstracts_en.ttl from http://wiki.dbpedia.org/Downloads