WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
714 stars 148 forks source link

UnicodeWarning and UnicodeEncodeError issues #136

Open nemobis opened 10 years ago

nemobis commented 10 years ago

Simple incompatibility between old image list and current master, or something more?

Resuming download, using directory eswikiarquitecturacom-20140628-wikidump [...] You didn't provide a path for index.php, we try this one: http://es.wikiarquitectura.com/index.php Checking api.php... http://es.wikiarquitectura.com/api.php api.php is OK Checking index.php... http://es.wikiarquitectura.com/index.php index.php is OK Analysing http://es.wikiarquitectura.com/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session XML dump was completed in the previous session Image list was completed in the previous session ./dumpgenerator.py:1232: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if filename2 not in listdir:

emijrp commented 10 years ago

Now it reads the image list file as unicode, and it is comparing with os.listdir() which is returning not unicode. I don't think it is serious, but I can check it tomorrow.

nemobis commented 10 years ago

Ok. The dump is proceeding, I'll check at the end if some image is missing. (Update: I forgot to count them, there is a big dump at https://archive.org/details/wiki-eswikiarquitecturacom though.)

nemobis commented 10 years ago

Some more despite https://github.com/WikiTeam/wikiteam/pull/124 , on wikihow.com with latest master:

Downloaded 30 pages "Hit" Someone on Pandanda, 0 edits "Hog Flip" in Halo, 0 edits File "dumpgenerator.py", line 1503, in main() File "dumpgenerator.py", line 1495, in main createNewDump(config=config, other=other) File "dumpgenerator.py", line 1241, in createNewDump generateXMLDump(config=config, titles=titles, session=other['session']) File "dumpgenerator.py", line 579, in generateXMLDump xml = getXMLPage(config=config, title=title, session=session) File "dumpgenerator.py", line 512, in getXMLPage print ' %s, %d edits' % (title, numberofedits) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 119: ordinal not in range(128)

PiRSquared17 commented 9 years ago

Can you reproduce this error still? The one you mentioned in the last comment has already been fixed. Not sure about the original one.

nemobis commented 9 years ago

Can't reproduce now either. Though the original comment might have been about an image list produced with one version of dumpgenerator and then used with another, incompatible one.

federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images  
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 8
Excluding titles from namespaces = None
1 namespaces found
    Retrieving titles in the namespace 8
.    5 titles retrieved in the namespace 8
5 page titles loaded
Titles saved at... eswikiarquitecturacom-20140919-titles.txt
Retrieving the XML for every page from "start"
    MediaWiki:Common.css, 8 edits
    MediaWiki:Mainpage, 1 edit
    MediaWiki:Newarticletext, 1 edit
    MediaWiki:Sidebar, 1 edit
    MediaWiki:Sitenotice, 1 edit
XML dump saved at... eswikiarquitecturacom-20140919-history.xml
Retrieving image filenames
....................................................................    Found 33592 images
33592 image names loaded
Image filenames and URLs saved at... eswikiarquitecturacom-20140919-images.txt
Retrieving images from "start"
Creating "./eswikiarquitecturacom-20140919-wikidump/images" directory
    Downloaded 10 images
^CTraceback (most recent call last):
  File "dumpgenerator.py", line 1602, in <module>
    main()
  File "dumpgenerator.py", line 1594, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1288, in createNewDump
    generateImageDump(config=config, other=other, images=images, session=other['session'])
  File "dumpgenerator.py", line 869, in generateImageDump
    filename), session=session)  # use Image: for backwards compatibility
  File "dumpgenerator.py", line 377, in getXMLFileDesc
    return getXMLPage(config=config, title=title, verbose=False, session=session)
  File "dumpgenerator.py", line 472, in getXMLPage
    xml = getXMLPageCore(params=params, config=config, session=session)
  File "dumpgenerator.py", line 440, in getXMLPageCore
    r = session.post(url=config['index'], data=params, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/adapters.py", line 327, in send
    timeout=timeout
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
    body=body, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php

Warning!: "./eswikiarquitecturacom-20140919-wikidump" path exists
There is a dump in "./eswikiarquitecturacom-20140919-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./eswikiarquitecturacom-20140919-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
17 images were found in the directory from a previous session
Retrieving images from "00 centro kimmel.jpg"
    Downloaded 10 images
nemobis commented 9 years ago

Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images

nemobis commented 9 years ago

Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images

nemobis commented 9 years ago

I'm also wondering whether resume works... it would be terrible if the bug makes us "close" incomplete dumps.

Analysing http://wiki.megatec.ru/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "Мастер-Web:Установка версии 7.2" Retrieving the XML for every page from "Мастер-Web:Установка версии 7.2" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... wikimegatecru-20141203-history.xml Image list is incomplete. Reloading... Retrieving image filenames ........ Found 3722 images

DrDevice commented 9 years ago

Sorry if this is bad etiquette (I'm new), but I was wondering if there was any update on this? Getting UnicodeEncodeError whenever I run python dumpgenerator.py --api=http://ark.gamepedia.com/api.php --xml --curonly --images --delay 5 --resume --path=arkgamepediacom-20150717-wikidump/, I get the following results:

Analysing http://ark.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
195 images were found in the directory from a previous session
Retrieving images from "Campfire.png"
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2031, in <module>
    main()
  File "dumpgenerator.py", line 2021, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1745, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 1071, in generateImageDump
    imagefile = open(filename3, 'wb')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 53: ordinal not in range(128)

I'm using the most recent dumpgenerator.py as of this writing.

emijrp commented 9 years ago

Hello DrDevice. This bug still need a fix. A workaround: You can remove the image filename in the -images.txt file in the dump directory, and then resume. According to that wiki, it is "Capture d'écran 2015-06-13 11.20.59.png". If you find more errors, remove them too, but I don't see more weird chars in the list.

http://ark.gamepedia.com/index.php?title=Special%3APrefixIndex&prefix=&namespace=6

DrDevice commented 9 years ago

emijrp, thank you very much! That seems to have cleared it up! It's been trucking on for a couple hours now, no errors. Crossing my fingers! :)

burner1024 commented 7 years ago

This is still an issue. I've tried patches from #279, didn't help.

ouaibe commented 6 years ago

I recently ran into the same issue with a similar message but for another part of the script.

The decode statement at https://github.com/WikiTeam/wikiteam/blob/master/dumpgenerator.py#L1999 was causing an exception, which had the script consider the image folder wasn't found and forced a dump resume to re-download all the images for no good reason. This line should probably be modified to distinguish non-existing dir from some other exception.

Anyways, the exception thrown was:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xxxx' in position YY: ordinal not in range(128)

And it turns out it was due to the fact that the Python 2.7 script used 'ascii' as a default encoding for the sys module as shown by python -c 'import sys; print(sys.getdefaultencoding())'

This was fixed by modifying /usr/lib/python2.7/sitecustomize.py to add the following lines that force utf8 default encoding in the Python 2.7 environment.

import sys sys.setdefaultencoding('UTF8')

Slider-Whistle commented 5 years ago

@ouaibe Thanks for the tip, I thought it must've been a bug in wikiteam. They should be able to set this somewhere theirselves right?

wlhlm commented 5 years ago

I'd like to pile on and say that I've also stumbled upon this issue or a similar one:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml --images
[...]
    Downloaded 5600 images
    Downloaded 5610 images
    Downloaded 5620 images
Traceback (most recent call last):
  File "../wikidump/wikiteam/dumpgenerator.py", line 2323, in <module>
    main()
  File "../wikidump/wikiteam/dumpgenerator.py", line 2313, in main
    resumePreviousDump(config=config, other=other)
  File "../wikidump/wikiteam/dumpgenerator.py", line 2030, in resumePreviousDump
    session=other['session'])
  File "../wikidump/wikiteam/dumpgenerator.py", line 1318, in generateImageDump
    text=u'The page "%s" was missing in the wiki (probably deleted)' % (title.decode('utf-8'))
  File "/home/wlhlm/vault/share/mc/wikidump/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 13: ordinal not in range(128)

Trying to resume, I'm hitting #250, meaning that dumpgenerator.py fails to detect previously downloaded images and starts from the beginning:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml
 --images --resume --path minecraft_degamepediacom-20190825-wikidump/
Checking API... https://minecraft-de.gamepedia.com/api.php
API is OK: https://minecraft-de.gamepedia.com/api.php
Checking index.php... https://minecraft-de.gamepedia.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2019 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://minecraft-de.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
0 images were found in the directory from a previous session
Retrieving images from "start"
    Downloaded 10 images
^C

But, of course, resuming doesn't do a whole since it will hit the same UnicodeEncodeError again.

The workaround described by @ouaibe worked. Editing siteconfig.py and adding sys.setdefaultencoding('UTF8') was unproblematic, because I was working in a virtualenv, but not sure how well it'd work when the global /usr/lib/python2.7/sitecustomize.py, since this can affect other python scripts.

Python 2.7.16 dumpgenerator.py 080b723334127e7bfff97497a9aea75c97f310d5