WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

Select namespaces by name as well as number for dumpgenerator.py #393

Open jrbray1 opened 4 years ago

jrbray1 commented 4 years ago

I was hoping to do a backup of a wiki with content, Category and Template namespaces only, to reduce size, and select the namespaces by keyword, something like

dumpgenerator.py --api=https://hornblower.fandom.com/api.php --xml --curonly --namespaces 0,Template,Category

But it expects numbers not names. I could hack something together by parsing the results of https://hornblower.fandom.com/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion=2, but it would seem easier if dumpgenerator did this for you. Have you considered doing this?

nemobis commented 4 years ago

jrbray1, 09/08/20 14:00:

it would seem easier if dumpgenerator did this for you. Have you considered doing this?

It might seem easier, but there are infinite possible namespace names, plus a dozen core ones each potentially translated in 400 languages. The name of the namespace makes sense only after we've contacted the API, or (even worse) screenscraped index.php output. The results will be unavoidably unpredictable, which is going to confuse people even more unless they're already well-versed in MediaWiki internals.

In other words, it's not clear to me what kind of user would be served by such a feature. We'll consider it if someone sends a patch, though!

jrbray1 commented 4 years ago

Not sure why the variety of namespace names is a problem, as https://www.mediawiki.org/wiki/Help:Namespaces talks about canonical namespaces in English and their foreign mappings. You could just support those canonical names and allow requests for 'Template' and 'Category'. This seems more robust that the user providing 10,14 and expecting that mapping is in place, but it would be just as easy with api parsing to allow a Frenchman to request --namespaces 0,Utilisateur, and not have to burrow into the api output to check what number that was.

Mediawiki documentation is all about names, not numbers.