attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Add argument to preserve unicode characters in json output. #307

Open wayneworkman opened 1 year ago

wayneworkman commented 1 year ago

Here's a snippit from the Anthropology article before this code change using the --json argument.

Their New Latin ' derived from the combining forms of the Greek words \"\u00e1nthr\u014dpos\" (, \"human\") and \"l\u00f3gos\" (, \"study\").

Here it is after these code changes, using --json and --preserve-unicode

Their New Latin ' derived from the combining forms of the Greek words \"ánthrōpos\" (, \"human\") and \"lógos\" (, \"study\").

For text computing activities, it's nice to have the option to preserve these unicode characters in their true form, rather than ASCII representations.

wayneworkman commented 1 year ago

Thinking more on this pull request, when you load the JSON output text using json.loads() in python, the unicode is represented correctly in the loaded JSON. Given that, I think this pull request is only valid for a few scenarios: