WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

No longer strip <sha1> tags #473

Closed yzqzss closed 1 year ago

yzqzss commented 1 year ago

It's 2023, and wikia isn't wikia anymore.
Missing sha1 makes it impossible to import wikidump revisions dedupely.

nemobis commented 1 year ago

So are you saying we never enter this "else" any more?

makoshark commented 1 year ago

The problem was that Wikia was adding <sha1/> tags under with pages, not revisions, in the XML.

Whether Wikia is Wikia and fact that the code was added a long time ago is not really relevant. The question is whether (a) the code's presence is causing problems and/or (b) whether this ever happens (at Fandom/Wikia or elsewhere). If Fandom/Wikia is no longer exporting these invalid SHA1's, I don't really object to removing it so I haven't seen it anywhere else.

My impression is that I don't really see what the harm is. If we ever see a SHA1 outside of a <revision>, we really do want to strip it!

yzqzss commented 1 year ago

yzqzss commented 1 year ago

:(