Closed rodricios closed 9 years ago
@rodricios better to not have "magic words" concept and expose the XPaths themselves in the @selects
decorator.
@eugene-eeo I'm not sure I know how to do that. Do you mean to expose the xpaths via keyword arguments?
Example would be nice :)
from libextract.xpaths import TEXT
@selects(TEXT)
def process(node):
return node
oh, lol :+1:
Although it removes a lot of boilerplate, I have mixed feelings that it removes a lot of simple composability for syntactic sugar of decorators. Perhaps I need to get used to a more declarative style? Although good looking, it is now harder to test. We could also get something called "decorator hell", like the click
API:
@click.command()
@click.option('--count', default=1, help='Number of greetings.')
@click.option('--name', prompt='Your name',
help='The person to greet.')
def hello(count, name):
"""Simple program that greets NAME for a total of COUNT times."""
for x in range(count):
click.echo('Hello %s!' % name)
Which admittedly looks nice but then once the options stack up everything just breaks down. Like Java annotations ;) I would recommend us to take a step back into the original codebase before the new architecture and then try and refactor from there. Where things are going currently, there's too much magic and I damn, I hate magic.
Perhaps we should have a functional core and an OO layer over the algorithm that allows us to declare nicely in a composable manner:
from libextract.template import Algorithm
my_alg = Algorithm()
@my_alg.maximiser
def maximise(results):
# code here
I totally agree with taking a step back into the original codebase; possibly even use the code that I posted in the blog as a starting point. The decorators idea has fizzled out.
I think before we continue with the "maximizer" approach, we should talk about if we still want to continue down that path.
My mind has lately been around data cleaning/formatting, but I'm not sure if that's too big a change in direction?
IMHO, cleaning should probably belong in a separate repo, as well as formatting. I envision that libextract would simply be a package that depends on other simple, well built packages. Look at flask for example ;)
Ok, I'm ok with refactoring formatters out of libextract. What do you think about removing the whole maximize feature? It's a move towards dropping the decorations :) On Fri, May 15, 2015 at 7:08 PM Eeo Jun notifications@github.com wrote:
IMHO, cleaning should probably belong in a separate repo, as well as formatting. I envision that libextract would simply be a package that depends on other simple, well built packages. Look at flask for example ;)
— Reply to this email directly or view it on GitHub https://github.com/datalib/libextract/issues/30#issuecomment-102564281.
I like the concept but I just think it needs a more "solid" abstraction (see objects). I'll make a new branch and push the appropriate changes.
Sounds good, looking forward to seeing what you come up with.
So far so good, I've implemented a declarative layer that allows one to do:
strategy = (
select(xpath),
process,
rank_with(some_func),
get_largest(5),
)
Due to our statscounter dependency I'm thinking about adding a process_scores_with
function so that users can easily convert the statscounter objects into numbers:
process_scores_with(StatsCounter.max)
Also there is now a way to make it easier for users to configure the strategies, namely the number of results returned since that was historically a big issue about configurability. Now you can do:
from libextract.tabular import build_strategy
from libextract.api import extract
r = extract(content, strategy=build_strategy(count=10))
And in the future we could also change the build_strategy
function to accomodate for more changes, e.g. the XPath query used, the scoring function, etc.
Closing
In our current implementation, you'll find node generators appearing in many different modules:
The boilerplate code can be summed up in two functions (the names and definitions are trivial and do not actually exist within libextract):
In this issue, I will directly address the first method by providing the decorator
iters
(and the xpath equivalent,selects
) as replacements.The second method is a little harder to concisely address with a single replacement decorator. Instead, I will demonstrate a second decorator that touches on the
processes
method, but is specific to the predictive aspect of libextract.iters, selects
The
lxml.ElementTree.iters
andlxml.ElementTree.xpath
methods were turned into decorators:That allows users to do simply do this:
... yielding:
maximize
The second construct is the
maximize
decoratorBefore I demonstrate how to use this decorator, let me show you what it can can easily(?) replace from the current implementation of libextract:
As a quick side note, in #1, @Beluki voices this opinion:
I take that to mean that him and others, including myself, would prefer to build web scraping/extraction algorithms from composable modules, or in other words, more transparency.
Why do I bring up @Beluki's comment? I believe the next new decorator,
maximizer
is in tune to his comment. Here's how you can recreate theTABULAR
ANDARTICLE
blackboxes:Here's the implementation:
Hopefully this is enough to get the ball rolling towards the immediate goal of cleaning up libextract, as it somehow became cluttered in the short time this project's been alive.
CC @datalib/contrib