attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

Non-latin characters missing in output #153

Open shyamupa opened 6 years ago

shyamupa commented 6 years ago

For pages like https://en.wikipedia.org/wiki/Eva_Khachatryan, the generated output,

Eva Khachatryan

Eva Khachatryan (, born on December 13, 1990), is an Armenian actress. 
...

The original page's content looks like

Eva Khachatryan (Armenian: Էվա Խաչատրյան, born on December 13, 1990), is an Armenian actress.

ignores the non-Latin script transliteration of the name.

Is there a way to preserve this?

jhsoby commented 5 years ago

Since that doesn't include "Armenian: " either, it looks to me more like an issue of the entire template (in this case, {{lang-hy}}) not being included. In other words, it looks like this might be the same as Issue #151.