Add argument to preserve unicode characters in json output.

attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps

GNU Affero General Public License v3.0

3.69k stars 959 forks source link

Here's a snippit from the Anthropology article before this code change using the --json argument.

Their New Latin ' derived from the combining forms of the Greek words \"\u00e1nthr\u014dpos\" (, \"human\") and \"l\u00f3gos\" (, \"study\").

Here it is after these code changes, using --json and --preserve-unicode

Their New Latin ' derived from the combining forms of the Greek words \"ánthrōpos\" (, \"human\") and \"lógos\" (, \"study\").

For text computing activities, it's nice to have the option to preserve these unicode characters in their true form, rather than ASCII representations.

attardi / wikiextractor

Add argument to preserve unicode characters in json output. #307