attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.75k stars 967 forks source link

Error: WikiExtractor.py:2703 #92

Open AmenRa opened 7 years ago

AmenRa commented 7 years ago

Hello, I've downloaded XML articles version of WIkipedia's articles using the Special:Export utility but i get en error with this article: 24th_Waffen_Mountain_Division_of_the_SS_Karstjäger.zip

I think the error is caused by some special character but i'm not sure. I really need a quick help because this problem is stopping my thesis' work.

Any help will be great. Thanks in advance.

attardi commented 7 years ago

Please specify which dump file you are using. Te extractor only works on files from the official dump section of the Wikipedia (https://dumps.wikimedia.org) or files in the same format.

AmenRa commented 7 years ago

I've downloaded articles in xml format from here: Special:Export. It's a downloading method provided by Wikipedia itself, and you can also use XML files downloaded in this way to build other wikis so i think they are pretty similar to XML files present in Wikipedia's dumps.

I used your script on more then 700 articles donwloaded in this way without problems. I attached the XML file causing the problem in my previous post if you want to take a look.

Thank you.

AmenRa commented 7 years ago

PS: I've removed that article from my collection. I'll let you know if the error is only raised by that article or if there are others articles that cause the same problem.

AmenRa commented 7 years ago

Found out that the extractor has problems with filenames that contains special characters like ū or ć. I don't know if it's a problem of Python itself or of the extractor's code, I don't have this type of problem in other languages (JavaScript/Node.js).

Hope this can help you in some way.