datalib / libextract

Extract data from websites using basic statistical magic
MIT License
503 stars 45 forks source link

API: get ElementTree #34

Open bofm opened 9 years ago

bofm commented 9 years ago

The api.extract function returns a generator of HtmlElement objects. If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

Currently I use the following lazy workaround:

from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise

def extract(document, encoding='utf-8', count=None):
    if isinstance(document, bytes):
        document = BytesIO(document)

    crank = partial(rank, count=count) if count else rank

    etree = parse_html(document, encoding=encoding)
    yield etree
    yield pipeline(
        select(etree),
        (measure, crank, finalise)
        )

r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next()  # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
    row_xpath = tree.getpath(row)
    print row_xpath

# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...
eugene-eeo commented 9 years ago

What do you think of returning a Result object? So you can do the following:

>>> r = extract(doc)
>>> r.tree
<ElementTree instance at 0x...>
>>> r.nodes
[node1, node2, node3]

Do you want to write a PR implementing the functionality? :)

rodricios commented 9 years ago

Hi @eugene-eeo, give me until Monday to look into what you're proposing. The internship's been keeping me busy, but I'll squeeze this in somehow :sweat_smile:

bofm commented 9 years ago

Straightforward: Extracted = namedtuple('Extracted', 'nodes, tree'). https://github.com/bofm/libextract/blob/nodes-and-tree/libextract/api.py The tests should be modified for a PL.

rodricios commented 9 years ago

@bofm: :+1: I'm in favor of using the namedtuple approach, and returning the tree alongside the HtmlElements

eugene-eeo commented 9 years ago

I don't see a reason why not, but I feel that a Result object is more intuitive as one can override some methods to allow the user to iterate over it:

>>> r = extract(doc)
>>> r.tree
<lxml.ElementTree>
>>> list(r)
[<Node>]

But once again I think this suggestion boils down into how minimalist the library would be. I am personally in favour of the Result object approach since it helps the user a little more. A nice compromise would probably be to inherit from the namedtuple and add our own __iter__ method.

rodricios commented 9 years ago

@eugene-eeo: :+1: I am ok with this. While "minimalism" is cliché, it fits well with libextract.

I don't think we require anything more than a namedtuple inheritance at most, given that we aren't really providing anything more than an algorithm, at least at the moment.

bofm commented 9 years ago

I don't think it's a good idea to override __iter__ method. Given an object of a namedtuple subclass, it is not obvious that iteration over this object produces nodes. It is not big overhead to add one line of code nodes = r.nodes.

eugene-eeo commented 9 years ago

While it is not a big overhead, imagine if the whole Python language were designed so that whenever you needed to iterate over some object you had to do:

for item in obj.iter:
    pass

I think that kind of illustrates my point :) Also the advantage is that it is more intuitive (depending on what you name it, I'm going with Result but if we agree on Extracted that's fine), and allows users to write quite expressive code:

extracted = extract(doc)
for item in extracted:
    print '#{0}'.format(item['id'])

Inheriting also allows us to add some docstrings in a nicer way-

class Result(namedtuple('Result', ['nodes', 'tree'])):
    """
    Describe the klass.
    """
    def __iter__(self):
        return iter(self.nodes)
bofm commented 9 years ago

I'm afraid somebody might fall into this

class Result(namedtuple('Result', ['nodes', 'tree'])):
    """
    Describe the klass.
    """
    def __iter__(self):
        return iter(self.nodes)

r = Result(('node1', 'node2'), 'tree')
print r
nodes, tree = r
print 'nodes:', nodes
print 'tree:', tree

# Result(nodes=('node1', 'node2'), tree='tree')
# nodes: node1
# tree: node2

after which he would need to go to the sources or the docs to realize that the __iter__ was overridden.

eugene-eeo commented 9 years ago

Fair enough :+1:

I'd advocate for inheritance just to add the docstring as there seems to be no nice way of adding it currently... correct me if I'm wrong.

bofm commented 9 years ago

The docstring is not a problem.

bofm commented 9 years ago

Oh, that was for the attributes, not for the class. Btw, __doc__ is writable, but only in Python 3. So yes, the subclass is the only easy way.

rodricios commented 9 years ago

What's the consensus on this? namedtuple subclass but no __iter__ override?

eugene-eeo commented 9 years ago

Yup.