WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
705 stars 147 forks source link

Add --insecure option #437

Open Pokechu22 opened 1 year ago

Pokechu22 commented 1 year ago

This option is needed for https://wiki.education.minecraft.net/api.php, which (in addition to broken routing breaking https://wiki.education.minecraft.net/Main_Page but allowing https://wiki.education.minecraft.net/index.php?title=Main_Page) has an expired certificate. It's now possible to run python2 -u dumpgenerator.py --xml --images --api https://wiki.education.minecraft.net/api.php --insecure.

I also fixed a bug where the "XML export on this wiki is broken, quitting." message wasn't written to errors.log. This can be seen via python2 -u dumpgenerator.py --xml --images --api https://ja.rodovid.org/api.php --retries 1.

nemobis commented 1 year ago

I understand this is sometimes necessary but I'm not a huge fan, especially for dumps made for upload to archive.org. How do you use this? I forgot, isn't there an environment variable allowing people to do this anyway?

Pokechu22 commented 1 year ago

I think the message in errors.log is enough to cover integrity concerns, as it does make it obvious that this was in use. And, wikis where the certificate is broken are particularly at-risk and thus important to save IMO.

How do you use this?

You just add --insecure to the arguments you pass to dumpgenerator.py. I didn't expose it in any of the frontend tools (as I personally don't use them).

I forgot, isn't there an environment variable allowing people to do this anyway?

I think openssl(?) has some environment variables it uses, but that's more for supported TLS versions instead of certificate validation. That was something I looked into in the past as well but I believe I never got it to work properly.

nemobis commented 1 year ago

I meant, how do you personally use this? Was it for specific domains you archived?

If it's too difficult to do with environment variables, I guess it's ok to provide the option. In the urllib3 docs I only find

Finally, you can suppress the warnings at the interpreter level by setting the PYTHONWARNINGS environment variable or by using the -W flag.

https://urllib3.readthedocs.io/en/stable/advanced-usage.html#tls-warnings

Pokechu22 commented 1 year ago

Yeah, I only use it for specific domains that I confirmed ahead of time had content but also had an expired certificate; it's not something I have enabled all the time. https://wiki.education.minecraft.net/api.php is the main example, though there have been a few others.