WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
705 stars 147 forks source link

Fix infinite loop on page title scraper #439

Closed Pokechu22 closed 1 year ago

Pokechu22 commented 1 year ago

This affected OSDev Wiki (which lacks an api.php), where after doing the 4 normal pages of Special:AllPages, it would attempt What_do_I_need_to_know_about_SMM-Zig_Bare_Bones" title="This is a special page, you cannot edit the page itself and then What_do_I_need_to_know_about_SMM-Zig_Bare_Bones%22++title%3D%22This+is+a+special+page%2C+you+cannot+edit+the+page+itself" title="This is a special page, you cannot edit the page itself, repeating in an infinite loop. These URLs came from the "Special page" tab (which has that title text). The fix is to stop at quotes, meaning the regex won't match. Quotes in actual URLs will be URL-encoded as %22 so the regex won't interfere with those (see enwiki as an example).

I also translated a comment from Spanish to English and fixed a typo in another comment. I don't know anything about the issue the Spanish comment is referring to, though.