Open bofm opened 9 years ago
What do you think of returning a Result object? So you can do the following:
>>> r = extract(doc)
>>> r.tree
<ElementTree instance at 0x...>
>>> r.nodes
[node1, node2, node3]
Do you want to write a PR implementing the functionality? :)
Hi @eugene-eeo, give me until Monday to look into what you're proposing. The internship's been keeping me busy, but I'll squeeze this in somehow :sweat_smile:
Straightforward: Extracted = namedtuple('Extracted', 'nodes, tree')
.
https://github.com/bofm/libextract/blob/nodes-and-tree/libextract/api.py
The tests should be modified for a PL.
@bofm: :+1: I'm in favor of using the namedtuple approach, and returning the tree alongside the HtmlElements
I don't see a reason why not, but I feel that a Result
object is more intuitive as one can override some methods to allow the user to iterate over it:
>>> r = extract(doc)
>>> r.tree
<lxml.ElementTree>
>>> list(r)
[<Node>]
But once again I think this suggestion boils down into how minimalist the library would be. I am personally in favour of the Result object approach since it helps the user a little more. A nice compromise would probably be to inherit from the namedtuple and add our own __iter__
method.
@eugene-eeo: :+1: I am ok with this. While "minimalism" is cliché, it fits well with libextract.
I don't think we require anything more than a namedtuple
inheritance at most, given that we aren't really providing anything more than an algorithm, at least at the moment.
I don't think it's a good idea to override __iter__
method. Given an object of a namedtuple subclass, it is not obvious that iteration over this object produces nodes. It is not big overhead to add one line of code nodes = r.nodes
.
While it is not a big overhead, imagine if the whole Python language were designed so that whenever you needed to iterate over some object you had to do:
for item in obj.iter:
pass
I think that kind of illustrates my point :) Also the advantage is that it is more intuitive (depending on what you name it, I'm going with Result
but if we agree on Extracted
that's fine), and allows users to write quite expressive code:
extracted = extract(doc)
for item in extracted:
print '#{0}'.format(item['id'])
Inheriting also allows us to add some docstrings in a nicer way-
class Result(namedtuple('Result', ['nodes', 'tree'])):
"""
Describe the klass.
"""
def __iter__(self):
return iter(self.nodes)
I'm afraid somebody might fall into this
class Result(namedtuple('Result', ['nodes', 'tree'])):
"""
Describe the klass.
"""
def __iter__(self):
return iter(self.nodes)
r = Result(('node1', 'node2'), 'tree')
print r
nodes, tree = r
print 'nodes:', nodes
print 'tree:', tree
# Result(nodes=('node1', 'node2'), tree='tree')
# nodes: node1
# tree: node2
after which he would need to go to the sources or the docs to realize that the __iter__
was overridden.
Fair enough :+1:
I'd advocate for inheritance just to add the docstring as there seems to be no nice way of adding it currently... correct me if I'm wrong.
The docstring is not a problem.
#python tip: How to customize a named tuple docstring:
Grid = namedtuple('Grid', ['x', 'y'])
Grid.x = property(Grid.x.fget, doc='abscissa')
— raymondh (@raymondh) April 26, 2015
Oh, that was for the attributes, not for the class. Btw, __doc__
is writable, but only in Python 3. So yes, the subclass is the only easy way.
What's the consensus on this? namedtuple
subclass but no __iter__
override?
Yup.
The
api.extract
function returns a generator ofHtmlElement
objects. If you need to analyze the results ofapi.extract
in relation with the HTML page, then it would be great to have a way to get theElementTree
object. This is required (for example) to get the XPath of anHtmlElement
usingetree.getpath(element)
as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.Currently I use the following lazy workaround: