chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

new feature allowing accessing records/text from Wikipedia other than namespace 0 #220

Closed ckot closed 5 years ago

ckot commented 5 years ago

Description

Motivation and Context

I wanted to extract the categories associated with category pages (namespace 14 - in Wikinews at least), since a category's categories are it's parent categories, and I'm interested in the category structure.

How Has This Been Tested?

I've merely been using it since I've made my changes and know that I didn't break anything regarding it's use of the default namespace '0' and that it works for namespace '14' as well. Without parsing the head of the raw document, I don't know how to verify whether the namespace value is valid, or what will happen if an invalid value is passed in, although I believe one will simply get a whole lot of nothing returned

Screenshots (if appropriate):

Types of changes

Checklist:

bdewilde commented 5 years ago

Hi @ckot , thanks for another clean and clear PR. This is a bit more niche than the last, and I'm wondering if there's a practical use case for being able to switch namespaces for each method call. Given that most people will always want namespace="0", adding it as a kwarg feels a bit iffy. Would it be inconvenient for your usage if it was added as a kwarg to __init__() instead?

ckot commented 5 years ago

moving the namespace param to __init__() isn't an issue to me. It simply means I'd need to create different Wikpedia objects to extract records from different namespaces from the same dump file. Should I push out a commit which makes this change?

bdewilde commented 5 years ago

Yeah, that would be great. Should be straightforward:

def __init__(self, data_dir=DATA_DIR, lang='en', version='latest', namespace="0"):
    ...
    self.namespace = namespace

and

def __iter__(self):
    ...
    if ns != self.namespace:
    ...

Then just need to update the class docs (Args and Attributes) and remove the bits added in texts() and records(). Thank you!

ckot commented 5 years ago

ok. great! this feature, while useful is really simple

bdewilde commented 5 years ago

Looks good! I might add a bit more detail to the relevant docs after merging, but just tweaks. Thanks for contributing. :+1:

ckot commented 5 years ago

Great! Hey I've been thinking about this, and I'm wondering what you think about this. I know this feature is 'out there' already, so this would potentially be a breaking change, but then again, I don't know if anyone has actually used it yet. What about either changing the namespace param to namespaces (iterable) or keeping it as namespace but allow either a string or iterable. I suppose the default value would need to be 'None' but that would get set to '0' if None. Then self._namespaces could be a set and in __iter__ the test would be if ns not in self._namespaces and records() could add a namespace field to the page dict. That way you could extract multiple namespaces in a single iteration. Not a big issue with wikinews but when testing out my code I discovered that it took quite a while to get to the first record in namespace 14. Just an idea.