Closed ckot closed 5 years ago
Hi @ckot , thanks for another clean and clear PR. This is a bit more niche than the last, and I'm wondering if there's a practical use case for being able to switch namespaces for each method call. Given that most people will always want namespace="0"
, adding it as a kwarg feels a bit iffy. Would it be inconvenient for your usage if it was added as a kwarg to __init__()
instead?
moving the namespace param to __init__()
isn't an issue to me. It simply means I'd need to create different Wikpedia objects to extract records from different namespaces from the same dump file.
Should I push out a commit which makes this change?
Yeah, that would be great. Should be straightforward:
def __init__(self, data_dir=DATA_DIR, lang='en', version='latest', namespace="0"):
...
self.namespace = namespace
and
def __iter__(self):
...
if ns != self.namespace:
...
Then just need to update the class docs (Args and Attributes) and remove the bits added in texts()
and records()
. Thank you!
ok. great! this feature, while useful is really simple
Looks good! I might add a bit more detail to the relevant docs after merging, but just tweaks. Thanks for contributing. :+1:
Great! Hey I've been thinking about this, and I'm wondering what you think about this. I know this feature is 'out there' already, so this would potentially be a breaking change, but then again, I don't know if anyone has actually used it yet. What about either changing the namespace
param to namespaces
(iterable) or keeping it as namespace
but allow either a string or iterable. I suppose the default value would need to be 'None' but that would get set to '0' if None. Then self._namespaces
could be a set and in __iter__
the test would be if ns not in self._namespaces
and records()
could add a namespace
field to the page dict. That way you could extract multiple namespaces in a single iteration. Not a big issue with wikinews but when testing out my code I discovered that it took quite a while to get to the first record in namespace 14. Just an idea.
Description
namespace
param (default value '0') toWikipedia.records()
andWikipedia.text()
self._extract_namespace
which gets set by each call torecords()
andtext()
__iter__()
fromif ns != '0':
toif ns != self._extract_namespace:
records()
andtext()
, both describing the new parameter, and in the 'Notes' section of the docstrings, described how to obtain namespace values from a dump fileMotivation and Context
I wanted to extract the categories associated with category pages (namespace 14 - in Wikinews at least), since a category's categories are it's parent categories, and I'm interested in the category structure.
How Has This Been Tested?
I've merely been using it since I've made my changes and know that I didn't break anything regarding it's use of the default namespace '0' and that it works for namespace '14' as well. Without parsing the head of the raw document, I don't know how to verify whether the namespace value is valid, or what will happen if an invalid value is passed in, although I believe one will simply get a whole lot of nothing returned
Screenshots (if appropriate):
Types of changes
Checklist: