diegoceccarelli / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
Apache License 2.0
252 stars 42 forks source link

Multiprocessing ability with Apache spark #46

Open tgalery opened 4 years ago

tgalery commented 4 years ago

Once you get to the main xml content of the wikidump transforming the xml into json can get a severe speed up by running on spark. This has already been done at the idio fork of this repo, so this pr severs as as basis for introducing this https://github.com/idio/json-wikipedia/pull/3/files. A few pointers: