Closed ckot closed 5 years ago
I only added support for Wikinews as I wasn't sure if my modifications to file_stub
were robust enough to handle other Wikimedia projects. I suppose if it isn't, a _gen_file_stub()
, which does project-specific branching to compute file_stub
could be added
Thanks for the updates — this looks good to me!
Great! Thanks!
I added a
project
named param with default value 'wiki' to the Wikipedia constructor, which allows one to specify a wikimedia project other than 'wikipedia'. This PR only adds support for 'wikinews'.Description
file_stub
format is now'{lang}{project}/{version}/{lang}{project}-{version}-pages-articles.xml.bz2'
Motivation and Context
I wanted to work with Wikinews data dumps. Other packages exist for extracting the page text but this project already performs category extraction, which is what I needed, so simply adding a simple feature to this project rather than a more complicated one to another project seemed to be my best option.
How Has This Been Tested?
file_stub
value were still sane, and that my exception handling for unhandled values for the newproject
named param worked._verify_url(url)
and calling it indownload()
just prior to the actual downloading of the URL, which would simply do a HEAD request, catch a 404 and then backoff to see if BASE_URL/{lang}{project} existed to determine what error message to provide in the case that whatever mediawiki project doesn't exist for a particular language, but figured I should talk with someone about this first. This would provide a function which could be added to a unit test though.Screenshots (if appropriate):
Types of changes
Checklist: