Open dav009 opened 8 years ago
Hi @dav009, thanks for your input!
As a matter of fact, I was struggling with Sweble a little bit so I thought I could maybe use wikiforia to preprocess the articles and drop the wiki syntax but after talking to my mentor I put this effort on hold -- for the midterm evaluation I will mine the topics on the abstracts only anyway so as for now it made no sense to invest more time in setting up the wikipedia dump reader.
But for the final deliverable of the project, I will definitely have a closer look at what you proposed!
So you are requiering the text out of the articles without any wikiboilerplate?
I was struggling with a similar need( getting text without boilerplate) and I just did a gigantic regex cleaning :
in https://github.com/idio/wiki2vec the first step is generating a "cleaned" wikipedia version, each article is dumped in a single line, links are replaced by DBPEDIA/Wikititle.
i.e: [[A|B]]
will be replaced by A DBPEDIA/B
If that shape is useful, You can generate one of those corpus it will only take a few mins, or I can also give you some of the ones I have for old wikipedias.
Jsonwikipedia also gives you a "stream" of paragraphs, which are boilerplate clean. but out of the box you have no way to aggregate the paragraphs belonging to the same article (if that is important for your implementation)
if you want only abstracts you can re-use the existing abstracts extracted from DBpedia, abstracts do not change so often to have your own pipeline (or use the DBpedia abstract extraction you cannot get cleaner than that)
Since a couple of release back DBpedia uses https://www.mediawiki.org/wiki/Extension:TextExtracts which is supported from Wikipedia and is also available live e.g. https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=1&explaintext=&titles=DBpedia
Other than that for smaller scale
@jimkont: For the texts of abstracts I just parsed the n-triple file long_abstracts_en.ttl
from http://wiki.dbpedia.org/Downloads
Hi @wojtuch I'm part of the dbpedia-spotlight community and I follow your project out of pure curiosity.
There are a bunch of wikipedia dump parsers. some of them are quite outdated/painful to use. a student from previous gsoc used a lot fo time on the same step.
If you find yourself stuck in this step consider taking a look at jsonwikipedia or at wikistats extractor (it is used by dbpedia-spotlight), they might already extract/contain the information you need for your project