datalib / libextract

Extract data from websites using basic statistical magic
MIT License
503 stars 45 forks source link

Modular approach to "pruning"; refactoring get_pairs's traversing and quantifying logic #16

Closed rodricios closed 9 years ago

rodricios commented 9 years ago

Branch 268eeda562ed96d73931e3a4b8dea41a3a7a3847 demonstrates a possible implementation of a modular, user-defined functions that can be added to the pipeline; you can read about the larger discussion here #1.

The pruners.py submodule provides "pruning" functions.

BACKGROUND:

The extraction algorithms require HT/XML tree traversals.

Before we analyze the frequency distributions of some particular element, which I'll refer to as the "prediction phase", we must first prune for nodes (lxml.html.HtmlElement) and quantify some measurement (numerical or collections.Counter).

What this submodule provides is a decorator called pruner and some predefined pruners.

The usecase is the following:

The user want's to measure "something"; he or she can either import our builtin's (libextract.quantifiers), or they can create their "quantifier" within a custom function, which they would then decorate with @pruner.

For example, if we were to not know that the text_length quantifier, nor that there existed the subnode_textlen_pruner pruner that does what they wanted already, would simply create our own, under the following protocols:

from requests import get
from libextract import extract, pruners
from libextract.html import parse_html
from libextract.html.article import get_node, get_text
from libextract.coretools import histogram, argmax

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')

#custom xpath
xpath  = '//*[not(self::script or self::style)]/\
                     text()[normalize-space()]/..'

# INPUTS
# "node" must declared, selector must be given as keyword argument
# user must assume it is an lxml.html.HtmlElement object
@pruner
def my_pruner(node, selector=xpath):
    text = node.text
    textlen = len(' '.join(text.split())) if text else 0
    #print(node,textlen)
    return node.getparent(), textlen
# OUTPUTS
# lxml.html.HtmlElement, numerical or collections.Counter

#add my_pruner to the pipeline
strat = (parse_html, my_pruner, histogram, argmax, get_node, get_text)
text = extract(r.content, strategy=strat)
print(text)

cc @libextract/owners @libextract/contrib

eugene-eeo commented 9 years ago

So @pruner is something like

def pruner(fn):
    def quantifier(nodes):
        for node in nodes:
            yield fn(node)
    return quanitifer

Maybe we should rename @pruner to something else like @metric?

eugene-eeo commented 9 years ago

Sorry I'm on mobile so I can't edit my comments. But IMHO I think we shouldn't have something like this from a performance viewpoint for example some metrics may require some state to be held throughout the loop.

Edit: changed my mind, something like this is good but only for certain use cases where all nodes are returned. Else, we would have to add a "filter" function after or before the pruner.

rodricios commented 9 years ago

Sorry the late response! Went for dinner. Will read thru and respond.

rodricios commented 9 years ago

So pruner is this:

from inspect import getargspec
def pruner(func):
    def decorator(etree):
        _, _, _, defaults = getargspec(func)
        ...
        selector = defaults[0]
        for node in etree.xpath(selector):
            yield func(node)
    return decorator

where func returns a node, numeric or Counter

eugene-eeo commented 9 years ago

That's impressive magic but once again I prefer not to have any magical functions in the library as it only serves to confuse people, especially with inspect where you are having AngularJS-like magic. I prefer

def pruner(selector):
    def decorator(func):
        ...
    return decorator

So users can declare the selector explicitly and at "compile-time":

@pruner(selector)
def func(node):
    return node, len(node)
rodricios commented 9 years ago

I prefer not to have any magical functions in the library as it only serves to confuse people

I agree with you on the 'confuse people' bit, but does this approach hold any merit in terms of adding the ability to configure custom pipelines "metrics"?

eugene-eeo commented 9 years ago

but does this approach hold any merit in terms of adding the ability to configure custom pipelines "metrics"?

I'm confused, which approach? The magical one or the compile-time one? Also, what do you mean by adding the ability to configure custom pipelines "metrics"?

rodricios commented 9 years ago

Actually, to be clear, I do think the inspect module is a problem, but I wasn't able to find a "better" way to implement what I was trying to do. So I opted for the above approach to demo a prototype.

Any thoughts on how we could do something like

@pruner(selection="//text()")
def my_metric(node):
    ...
    return node, numeric
rodricios commented 9 years ago

Right, what I mean by "configuring custom pipeline metrics" is this:

while we iterate through nodes that are selected by our xpath, we run a 'metric' that measures some property of a node (ie. text length of subnodes, frequency of subnodes, etc.). Do we want/expect users to configure this portion of the extraction process?

eugene-eeo commented 9 years ago

For your first question:

def pruner(selector):
    def decorator(fn):
        def quantifier(etree):
            for node in etree.xpath(selector):
                yield fn(node)
        return quantifier
    return decorator

And yes, they should be able to configure every portion in theory. Our job is to provide the sensible defaults for the 80% use case, and a solid foundation and abstraction for the 20% to easily build their algorithms upon.

rodricios commented 9 years ago

Beautiful! At least compared to what's currently implement :P

Now, there must be better approaches to this though no?

eugene-eeo commented 9 years ago

Not really, the other (arguably better) approach is to not have overly abstract functions but let the user handle the mapping of the quantifiers. Not only is this more performant, this also forces the user to understand what they're doing better.

msiemens commented 9 years ago

I'm not sure I've understood what the actual problem is. From my current understanding it's that users should be able to insert their own steps into the pipeline. That would require them to traverse the tree and run a function on every matching node. Is this correct?

Concerning the current implementation, I'd really try to avoid using inspection of the decorated function if possible. @eugene-eeo's solution is the cleanest I can think of at the moment.

eugene-eeo commented 9 years ago

Yes, this issue suggests that a @prune decorator should be used, to comply with an unenforced protocol (it may sound sassy but its not meant to be).

rodricios commented 9 years ago

So sassy @eugene-eeo ;) Jk.

@msiemens to reiterate my point, what this branch proposes is a simplified workflow for a user.

Consider the current work required to add this particular, custom step and replace the get_node_length_pairs (after the parse_html, before the histogram step):

He or she must implement two things - a pruner1 and a quantifier (which I'll combine into a single function):

def my_pruner(etree)
    for node in etree.xpath("//text()"):  # <- pruner
        text = node.text
        textlen = len(' '.join(text.split())) if text else 0 # <- quantifier
        yield node, textlen

Now consider the following:

@pruner("//text()") # with eugene-eeo's suggestion
def my_pruner(node):
    text = node.text
    textlen = len(' '.join(text.split())) if text else 0 # <- quantifier
    yield node, textlen

Another reason to consider this change is that the for node in etree.xpath(selector) is repeated in get_node_counter_pairs and theoretically, it's going to keep on repeating itself if we add more node, numerical or Counter producers into libextract.quantifiers.

1* Just in case there's confusion, I'm using the term "pruning" from this definition

msiemens commented 9 years ago

Okay, in that light having a decorator totally makes sense. Personally I'd prefer @eugene-eeo's suggestion as it's less magic.

Regarding all the different technical terms used throughout the discussions, could we make something like a wiki page that explains them and maybe links to a more detailled definition?

rodricios commented 9 years ago

I added @eugene-eeo's suggestion in 33471f261f29c92564553c27d1d6dacc646cb425. And yeah, having a wiki is a good idea :P

As an aside, I'm personally in favor of not using technical terms as they generally add barriers; with that said, is there a term rather than pruner that fits this submodule more naturally?

eugene-eeo commented 9 years ago

I'd say rename the module to protocol.py and rename the function to @selects, and move the pruners (textlen, counters) to their previous modules (html.article, html.tabular)

rodricios commented 9 years ago

I'm working on refactoring argmax methods from html.article and html.tabular.

I think, philosophically speaking, those methods should be outside of those submodules, and arguably html.article and html.tabular should only wrap the intermediate functions, not contain the definitions themselves.

eugene-eeo commented 9 years ago

Argmax in tabular could just use coretools.argmax, yes I agree that they should live outside the module.

rodricios commented 9 years ago

f79e0e925fd87eeea0390acdef4a2bb1a32288bf

rodricios commented 9 years ago

Closing due to departure of architecture present during the activity of this issue; #33