datalib / libextract

Extract data from websites using basic statistical magic
MIT License
504 stars 45 forks source link

New architecture, removed a lot of boilerplate #30

Closed rodricios closed 9 years ago

rodricios commented 9 years ago

In our current implementation, you'll find node generators appearing in many different modules:

The boilerplate code can be summed up in two functions (the names and definitions are trivial and do not actually exist within libextract):

def iters(etree, *tags):
    for node in etree.iter(*tags): # <- generator
        do something
        yield or return

def processes(tpls, func, predicate):
    for tpl in tpls: # <- iterator
        if predicate(tpl):
            yield func(tpl)
        else:
            yield tpl

In this issue, I will directly address the first method by providing the decorator iters (and the xpath equivalent, selects) as replacements.

The second method is a little harder to concisely address with a single replacement decorator. Instead, I will demonstrate a second decorator that touches on the processes method, but is specific to the predictive aspect of libextract.

iters, selects

The lxml.ElementTree.iters and lxml.ElementTree.xpath methods were turned into decorators:

# tags will designate which nodes to generate
def iters(*tags):
    # *fn* is the user's function (allowing him to do per-node logic)
    def decorator(fn): 
        def iterator(node, *args):
            for elem in node.iter(*tags):
                yield fn(elem,*args)
        return iterator
    return decorator

def selects(xpath):
    # magic words for choosing 
    # intricate xpath expressions 
    if xpath == "text":
        xpath = NODES_WITH_TEXT
    elif xpath == "tabular":
        xpath = NODES_WITH_CHILDREN
    def decorator(fn):
        def selector(node, *args):
            for n in node.xpath(xpath):
                yield fn(n)
        return selector
    return decorator

That allows users to do simply do this:

@iters('tr')
def get_rows(node):
    return node

rows = list(pipeline(r.content, (parse_html, get_rows)))

... yielding:

[<Element tr at 0x65ad778>,
 <Element tr at 0x65ad7c8>,
 <Element tr at 0x65ad818>,
 <Element tr at 0x65ad868>,
 <Element tr at 0x65ad8b8>,
 <Element tr at 0x65ad908>,
 <Element tr at 0x65ad958>,
 <Element tr at 0x65ad9a8>,
...]

maximize

The second construct is the maximize decorator

Before I demonstrate how to use this decorator, let me show you what it can can easily(?) replace from the current implementation of libextract:

# libextract/tabular.py
def node_counter_argmax(pairs):
    for node, counter in pairs:
        if counter:
            yield node, argmax(counter)

# libextract/coretools.py
def histogram(iterable):
    hist = Counter()
    for key, score in iterable:
        hist[key] += score
    return hist

def argmax(counter):
    return counter.most_common(1)[0]

As a quick side note, in #1, @Beluki voices this opinion:

For libextract, I think the best way to go about it is to write the functions as if combinations of them weren't available.

I take that to mean that him and others, including myself, would prefer to build web scraping/extraction algorithms from composable modules, or in other words, more transparency.

Why do I bring up @Beluki's comment? I believe the next new decorator, maximizer is in tune to his comment. Here's how you can recreate the TABULAR AND ARTICLE blackboxes:

from libextract.core import parse_html, pipeline
from libextract.generators import selects, maximize, iters
from libextract.metrics import StatsCounter

@maximize(5, lambda x: x[1].max())
@selects("tabular") # uses table-extracting xpath
def group_parents_children(node):
    return node, StatsCounter([child.tag for child in node])

@maximize(5, lambda x: x[1])
@selects("text") # uses text-extracting xpath
def group_nodes_texts(node):
    return node.getparent(), len(" ".join(node.text_content().split()))

tables = pipeline(r.content, (parse_html, group_parents_children,))
text = pipeline(r.content, (parse_html, group_nodes_texts,))

Here's the implementation:

# *max_fn* is really just the same as the "key"
# argument in "sorts" and "sorted"
# *top* controls the number of elements to 
# return (post-sorting)
def maximize(top=5, max_fn=select_score): 
    # *fn* is a generator function that get's decorated 
    # on top of a generator function (like an iters-decorated
    # custom method)
    def decorator(fn):
        def iterator(*args):
            return nlargest(top, fn(*args), key=max_fn)
        return iterator
    return decorator

Hopefully this is enough to get the ball rolling towards the immediate goal of cleaning up libextract, as it somehow became cluttered in the short time this project's been alive.

CC @datalib/contrib

eugene-eeo commented 9 years ago

@rodricios better to not have "magic words" concept and expose the XPaths themselves in the @selects decorator.

rodricios commented 9 years ago

@eugene-eeo I'm not sure I know how to do that. Do you mean to expose the xpaths via keyword arguments?

Example would be nice :)

eugene-eeo commented 9 years ago
from libextract.xpaths import TEXT

@selects(TEXT)
def process(node):
    return node
rodricios commented 9 years ago

oh, lol :+1:

eugene-eeo commented 9 years ago

Although it removes a lot of boilerplate, I have mixed feelings that it removes a lot of simple composability for syntactic sugar of decorators. Perhaps I need to get used to a more declarative style? Although good looking, it is now harder to test. We could also get something called "decorator hell", like the click API:

@click.command()
@click.option('--count', default=1, help='Number of greetings.')
@click.option('--name', prompt='Your name',
              help='The person to greet.')
def hello(count, name):
    """Simple program that greets NAME for a total of COUNT times."""
    for x in range(count):
        click.echo('Hello %s!' % name)

Which admittedly looks nice but then once the options stack up everything just breaks down. Like Java annotations ;) I would recommend us to take a step back into the original codebase before the new architecture and then try and refactor from there. Where things are going currently, there's too much magic and I damn, I hate magic.

eugene-eeo commented 9 years ago

Perhaps we should have a functional core and an OO layer over the algorithm that allows us to declare nicely in a composable manner:

from libextract.template import Algorithm
my_alg = Algorithm()

@my_alg.maximiser
def maximise(results):
    # code here
rodricios commented 9 years ago

I totally agree with taking a step back into the original codebase; possibly even use the code that I posted in the blog as a starting point. The decorators idea has fizzled out.

rodricios commented 9 years ago

I think before we continue with the "maximizer" approach, we should talk about if we still want to continue down that path.

My mind has lately been around data cleaning/formatting, but I'm not sure if that's too big a change in direction?

eugene-eeo commented 9 years ago

IMHO, cleaning should probably belong in a separate repo, as well as formatting. I envision that libextract would simply be a package that depends on other simple, well built packages. Look at flask for example ;)

rodricios commented 9 years ago

Ok, I'm ok with refactoring formatters out of libextract. What do you think about removing the whole maximize feature? It's a move towards dropping the decorations :) On Fri, May 15, 2015 at 7:08 PM Eeo Jun notifications@github.com wrote:

IMHO, cleaning should probably belong in a separate repo, as well as formatting. I envision that libextract would simply be a package that depends on other simple, well built packages. Look at flask for example ;)

— Reply to this email directly or view it on GitHub https://github.com/datalib/libextract/issues/30#issuecomment-102564281.

eugene-eeo commented 9 years ago

I like the concept but I just think it needs a more "solid" abstraction (see objects). I'll make a new branch and push the appropriate changes.

rodricios commented 9 years ago

Sounds good, looking forward to seeing what you come up with.

eugene-eeo commented 9 years ago

So far so good, I've implemented a declarative layer that allows one to do:

strategy = (
    select(xpath),
    process,
    rank_with(some_func),
    get_largest(5),
)

Due to our statscounter dependency I'm thinking about adding a process_scores_with function so that users can easily convert the statscounter objects into numbers:

process_scores_with(StatsCounter.max)
eugene-eeo commented 9 years ago

Also there is now a way to make it easier for users to configure the strategies, namely the number of results returned since that was historically a big issue about configurability. Now you can do:

from libextract.tabular import build_strategy
from libextract.api import extract

r = extract(content, strategy=build_strategy(count=10))

And in the future we could also change the build_strategy function to accomodate for more changes, e.g. the XPath query used, the scoring function, etc.

rodricios commented 9 years ago

Closing