WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
705 stars 147 forks source link

Use urllib.quote on str, not unicode, for image titles #469

Open Pokechu22 opened 1 year ago

Pokechu22 commented 1 year ago

Before, files with non-ascii names would crash when using --xmlrevisions (a regression from #451 by @yzqzss). For example, French Wikiversity has Fichier:Ensemble_access_croisé.jpg (I'm using that as an example here because it's a small site that I know will continue to exist).

$ python2 -b -u dumpgenerator.py --xmlrevisions --images --force --api https://fr.wikiversity.org/w/api.php
Checking API... https://fr.wikiversity.org/w/api.php
API is OK: https://fr.wikiversity.org/w/api.php
Checking index.php... https://fr.wikiversity.org/w/index.php
index.php is OK
PLEASE, DO NOT USE THIS SCRIPT TO DOWNLOAD WIKIMEDIA PROJECTS!
Download the dumps from http://dumps.wikimedia.org
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://fr.wikiversity.org/w/api.php
Trying generating a new dump into a new directory...
Retrieving image filenames
..    Found 84 images
84 image names loaded
Image filenames and URLs saved at... frwikiversityorg_w-20230624-images.txt
Retrieving images from "start"
Creating "./frwikiversityorg_w-20230624-wikidump/images" directory
    Downloaded 10 images
    Downloaded 20 images
    Downloaded 30 images
    Downloaded 40 images
/usr/lib/python2.7/urllib.py:1306: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "dumpgenerator.py", line 2572, in <module>
    main()
  File "dumpgenerator.py", line 2564, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 2147, in createNewDump
    session=other['session'])
  File "dumpgenerator.py", line 1524, in generateImageDump
    r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))
  File "/usr/lib/python2.7/urllib.py", line 1306, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xe9'

Python 2's str vs unicode and the implicit conversions are a mess.

nemobis commented 1 year ago

Ah yes, that error has been annoying. Thanks!

Do we really need to pass the URL as unicode? We have the titles as unicode because we're trying to convert at the I/O boundary (so that e.g. we compare titles in unicode only), but I'm not sure requests needs it.

Pokechu22 commented 1 year ago

config[api] is str, so probably the URL should also be str, good catch.

Pokechu22 commented 11 months ago

Anything holding up this being merged?