WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

Pass requests session to mwclient #441

Closed Pokechu22 closed 2 years ago

Pokechu22 commented 2 years ago

This means it uses our configured user-agent, as well as any cookies.

This was needed to save https://wiiki.wii-homebrew.com/Hauptseite and https://breath-of-the-wild.wii-homebrew.com/Hauptseite which are behind cloudflare. I completed the cloudflare challenge in my browser, and then saved the following into a file named cookies.txt (where 1698010873 is the expirey time as a unix timestamp - probably it could have been anything, and longalphanumericstring was the value seen in firefox's devtools, and the values are separated by tabs, and there was no trailing newline):

# Netscape HTTP Cookie File
.wii-homebrew.com   TRUE    /   TRUE    1698010873  cf_clearance    longalphanumericstring

Then running with --cookies cookies.txt worked properly. Note that I also changed the user-agent to match what firefox currently uses (and what was used when I completed the challenge); I'm not sure if this was actually needed:

diff --git a/dumpgenerator.py b/dumpgenerator.py
index d99cb68..ef53999 100755
--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@@ -509,7 +509,8 @@ def getUserAgent():
         # firefox
         #'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
         #'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
-        'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0'
+        #'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0'
+        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'
     ]
     return useragents[0]
nemobis commented 2 years ago

Makes sense. Thanks! It's unfortunate that so many wikis are behind cloudflare, but it can't be ignored. I wonder if we can identify such cases and print an informative error with instructions.