Once you get to the main xml content of the wikidump transforming the xml into json can get a severe speed up by running on spark. This has already been done at the idio fork of this repo, so this pr severs as as basis for introducing this https://github.com/idio/json-wikipedia/pull/3/files. A few pointers:
since the forks have diverged severely, it's easier to start a new pr (from a branch)
Once you get to the main xml content of the wikidump transforming the xml into json can get a severe speed up by running on spark. This has already been done at the idio fork of this repo, so this pr severs as as basis for introducing this https://github.com/idio/json-wikipedia/pull/3/files. A few pointers: