chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

New feature allowing working with Wikinews media dumps #219

Closed ckot closed 5 years ago

ckot commented 5 years ago

I added a project named param with default value 'wiki' to the Wikipedia constructor, which allows one to specify a wikimedia project other than 'wikipedia'. This PR only adds support for 'wikinews'.

Description

Motivation and Context

I wanted to work with Wikinews data dumps. Other packages exist for extracting the page text but this project already performs category extraction, which is what I needed, so simply adding a simple feature to this project rather than a more complicated one to another project seemed to be my best option.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

ckot commented 5 years ago

I only added support for Wikinews as I wasn't sure if my modifications to file_stub were robust enough to handle other Wikimedia projects. I suppose if it isn't, a _gen_file_stub(), which does project-specific branching to compute file_stub could be added

bdewilde commented 5 years ago

Thanks for the updates — this looks good to me!

ckot commented 5 years ago

Great! Thanks!