Closed rodricios closed 9 years ago
So @pruner
is something like
def pruner(fn):
def quantifier(nodes):
for node in nodes:
yield fn(node)
return quanitifer
Maybe we should rename @pruner
to something else like @metric
?
Sorry I'm on mobile so I can't edit my comments. But IMHO I think we shouldn't have something like this from a performance viewpoint for example some metrics may require some state to be held throughout the loop.
Edit: changed my mind, something like this is good but only for certain use cases where all nodes are returned. Else, we would have to add a "filter" function after or before the pruner.
Sorry the late response! Went for dinner. Will read thru and respond.
So pruner
is this:
from inspect import getargspec
def pruner(func):
def decorator(etree):
_, _, _, defaults = getargspec(func)
...
selector = defaults[0]
for node in etree.xpath(selector):
yield func(node)
return decorator
where func
returns a node, numeric or Counter
That's impressive magic but once again I prefer not to have any magical functions in the library as it only serves to confuse people, especially with inspect
where you are having AngularJS-like magic. I prefer
def pruner(selector):
def decorator(func):
...
return decorator
So users can declare the selector explicitly and at "compile-time":
@pruner(selector)
def func(node):
return node, len(node)
I prefer not to have any magical functions in the library as it only serves to confuse people
I agree with you on the 'confuse people' bit, but does this approach hold any merit in terms of adding the ability to configure custom pipelines "metrics"?
but does this approach hold any merit in terms of adding the ability to configure custom pipelines "metrics"?
I'm confused, which approach? The magical one or the compile-time one? Also, what do you mean by adding the ability to configure custom pipelines "metrics"?
Actually, to be clear, I do think the inspect
module is a problem, but I wasn't able to find a "better" way to implement what I was trying to do. So I opted for the above approach to demo a prototype.
Any thoughts on how we could do something like
@pruner(selection="//text()")
def my_metric(node):
...
return node, numeric
Right, what I mean by "configuring custom pipeline metrics" is this:
while we iterate through nodes that are selected by our xpath, we run a 'metric' that measures some property of a node (ie. text length of subnodes, frequency of subnodes, etc.). Do we want/expect users to configure this portion of the extraction process?
For your first question:
def pruner(selector):
def decorator(fn):
def quantifier(etree):
for node in etree.xpath(selector):
yield fn(node)
return quantifier
return decorator
And yes, they should be able to configure every portion in theory. Our job is to provide the sensible defaults for the 80% use case, and a solid foundation and abstraction for the 20% to easily build their algorithms upon.
Beautiful! At least compared to what's currently implement :P
Now, there must be better approaches to this though no?
Not really, the other (arguably better) approach is to not have overly abstract functions but let the user handle the mapping of the quantifiers. Not only is this more performant, this also forces the user to understand what they're doing better.
I'm not sure I've understood what the actual problem is. From my current understanding it's that users should be able to insert their own steps into the pipeline. That would require them to traverse the tree and run a function on every matching node. Is this correct?
Concerning the current implementation, I'd really try to avoid using inspection of the decorated function if possible. @eugene-eeo's solution is the cleanest I can think of at the moment.
Yes, this issue suggests that a @prune
decorator should be used, to comply with an unenforced protocol (it may sound sassy but its not meant to be).
So sassy @eugene-eeo ;) Jk.
@msiemens to reiterate my point, what this branch proposes is a simplified workflow for a user.
Consider the current work required to add this particular, custom step and replace the get_node_length_pairs
(after the parse_html
, before the histogram
step):
He or she must implement two things - a pruner
1 and a quantifier
(which I'll combine into a single function):
def my_pruner(etree)
for node in etree.xpath("//text()"): # <- pruner
text = node.text
textlen = len(' '.join(text.split())) if text else 0 # <- quantifier
yield node, textlen
Now consider the following:
@pruner("//text()") # with eugene-eeo's suggestion
def my_pruner(node):
text = node.text
textlen = len(' '.join(text.split())) if text else 0 # <- quantifier
yield node, textlen
Another reason to consider this change is that the for node in etree.xpath(selector)
is repeated in get_node_counter_pairs
and theoretically, it's going to keep on repeating itself if we add more node, numerical or Counter
producers into libextract.quantifiers
.
1* Just in case there's confusion, I'm using the term "pruning" from this definition
Okay, in that light having a decorator totally makes sense. Personally I'd prefer @eugene-eeo's suggestion as it's less magic.
Regarding all the different technical terms used throughout the discussions, could we make something like a wiki page that explains them and maybe links to a more detailled definition?
I added @eugene-eeo's suggestion in 33471f261f29c92564553c27d1d6dacc646cb425. And yeah, having a wiki is a good idea :P
As an aside, I'm personally in favor of not using technical terms as they generally add barriers; with that said, is there a term rather than pruner
that fits this submodule more naturally?
I'd say rename the module to protocol.py
and rename the function to @selects
, and move the pruners (textlen, counters) to their previous modules (html.article, html.tabular)
I'm working on refactoring argmax methods from html.article
and html.tabular
.
I think, philosophically speaking, those methods should be outside of those submodules, and arguably html.article
and html.tabular
should only wrap the intermediate functions, not contain the definitions themselves.
Argmax in tabular could just use coretools.argmax
, yes I agree that they should live outside the module.
f79e0e925fd87eeea0390acdef4a2bb1a32288bf
Closing due to departure of architecture present during the activity of this issue; #33
Branch 268eeda562ed96d73931e3a4b8dea41a3a7a3847 demonstrates a possible implementation of a modular, user-defined functions that can be added to the pipeline; you can read about the larger discussion here #1.
The pruners.py submodule provides "pruning" functions.
BACKGROUND:
The extraction algorithms require HT/XML tree traversals.
Before we analyze the frequency distributions of some particular element, which I'll refer to as the "prediction phase", we must first prune for nodes (
lxml.html.HtmlElement
) and quantify some measurement (numerical
orcollections.Counter
).What this submodule provides is a decorator called
pruner
and some predefined pruners.The usecase is the following:
The user want's to measure "something"; he or she can either import our builtin's (
libextract.quantifiers
), or they can create their "quantifier" within a custom function, which they would then decorate with@pruner
.For example, if we were to not know that the
text_length
quantifier, nor that there existed thesubnode_textlen_pruner
pruner that does what they wanted already, would simply create our own, under the following protocols:cc @libextract/owners @libextract/contrib