WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
714 stars 148 forks source link

Dumpgenerator rewrite: Use pywikibot (pywikipedia) #195

Open PiRSquared17 opened 9 years ago

PiRSquared17 commented 9 years ago

Nemo has suggested to port the existing code to use the pywikipediabot framework.

emijrp commented 9 years ago

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

nemobis commented 9 years ago

Emilio J. Rodríguez-Posada, 29/09/2014 09:41:

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

Getting API entry point, page lists and XML could all be delegated to PWB.

This would be the implementation of the rewrite plan: https://meta.wikimedia.org/wiki/WikiTeam/Dumpgenerator_rewrite

emijrp commented 9 years ago

Our API entry point is pretty simple with requests module and works fine, right? The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

I mean, we can use/copy some of their modules/functions. But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.

Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.

Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.

2014-09-29 9:48 GMT+02:00 nemobis notifications@github.com:

Emilio J. Rodríguez-Posada, 29/09/2014 09:41:

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

Getting API entry point, page lists and XML could all be delegated to PWB.

— Reply to this email directly or view it on GitHub https://github.com/WikiTeam/wikiteam/issues/195#issuecomment-57127022.

nemobis commented 9 years ago

Emilio J. Rodríguez-Posada, 29/09/2014 10:08:

Our API entry point is pretty simple with requests module and works fine, right?

Dunno. There's also the screenscraping part which is a bunch of regex hacks. Same for entry point extraction, already handled by pwb https://gerrit.wikimedia.org/r/160207

The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

You could test https://gerrit.wikimedia.org/r/#/c/136352/

I mean, we can use/copy some of their modules/functions.

Forking PWB is not an option.

But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.

Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.

That's what the rewrite branch is for. :)

Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.

Of course a partnership needs to have benefit for both sides.

jayvdb commented 9 years ago

the core pwb library (v2.0) doesnt have a lot of dependencies. in fact, only one dependency: httplib2. We are in the process of completing/packaging pwb v2.0

And that is the primary problem, I believe. wikiteam has moved to requests, while pwb uses httplib2. I think we could solve that by either a) improving pwb to support requests, or b) lots of testing of wikiteam/requests and a 'pwb lite with only httplib2 dependency' package

jayvdb commented 9 years ago

What is the minimum version of python that wikiteam wants support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).

nemobis commented 9 years ago

John Vandenberg, 29/09/2014 14:35:

What is the minimum version of python that wikiteam /wants/ support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).

I think 2.6+ is enough now, that BC code was from earlier on. Nowadays most compatibility complaints we get are from python3 users and similar.

jayvdb commented 9 years ago

pywikibot works on python 3 ;-) with 500+ tests https://travis-ci.org/wikimedia/pywikibot-core

nemobis commented 9 years ago

Thanks John for https://meta.wikimedia.org/w/index.php?title=WikiTeam%2FDumpgenerator_rewrite&diff=10039513&oldid=8892313 emjirp, the last bullet changed is already one feature we'd gain.

PiRSquared17 commented 9 years ago

@jayvdb Does pwb support old versions of MediaWiki (e.g. MW 1.9)?

nemobis commented 9 years ago

And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

I had forgotten it but the rewrite page says "if the api call uses api.CachedRequest, it will write to the disk." So that's another bug it's supposed to fix.

nemobis commented 4 years ago

6 years have passed, and now the younger versions of MediaWiki might be more frequent out there than the ancient ones we focused a lot on. It's possible that nowadays we can freeze the features of the old index.php scraping and rely on a library like mwclient for the newer versions. Time will tell.