diegoceccarelli / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
Apache License 2.0
252 stars 42 forks source link

Java API usage: how to parse a single article ? #26

Closed thomasopsomer closed 6 years ago

thomasopsomer commented 6 years ago

Hi,

Initially posted here idio/json-wikipedia#43, but should have started here as it is the main repo :)

I'm trying to parse single wikipedia xml file. Like the mercedes.xml in the test of this repo. Following the code in the test section I tried something like:

import it.cnr.isti.hpc.wikipedia.article.Article
import it.cnr.isti.hpc.wikipedia.parser.ArticleParser

val parser = new ArticleParser("en")
val testXml = IOUtils.getFileAsUTF8String("./mercedes.xml")
val testArticle = new Article()
parser.parse(testArticle, testXml)

But the result is strange. Many properties are blank, like title, wikiTitle, ... and paragraphs / clean text are also wrongly parsed. I guess I'm doing something wrong ^^ If you could show some usage of the API to process a single article in xml it would be very great :)

Thanks, Thomas

tgalery commented 6 years ago

closing because of the discussion in idio/json-wikipedia#43