Closed moreymat closed 4 years ago
@sunu If a PR with a narrower scope is preferrable, what I need the most is the replacement of calls to find
, findall
and findtext
with their (functional) equivalents using xpath
, in the parse
and clean_html
operations.
Hey @moreymat, I was waiting to check whether the changes could break any existing crawlers. But a quick read through the docs suggests the syntax supported by find
and findall
is a subset of the full xpath spec. So it should be fine.
I'll merge and do make a release shortly. Thanks a lot for the PR.
@sunu it's slightly more complicated: the ElementTree API supports a subset of the XPath specification plus idiosyncratic notation for namespaces:
Another important difference is namespace handling, which uses the {namespace}tagname notation. This is not supported by XPath.
(https://lxml.de/3.0/FAQ.html#xpath-and-document-traversal)
Curly braces are used for example for DAV in fetch
, so you might have missed a few breaking points :-/
This PR enables to define metadata whose value is parsed from attribute values.
I needed to replace
find()
withxpath()
because attribute selection requires the full XPath specification and not the subset supported by the ElementTree API (https://lxml.de/3.0/FAQ.html#what-are-the-findall-and-xpath-methods-on-element-tree). To be consistent I also replaced occurrences offindall()
andfindtext()
inparse
andclean
.One can now define, in the YAML config file, a metadata field named
source
whose value is defined as that of the attributesrc
of animg
:The alternative would be to define custom
parse
andparse_for_metadata
functions in each crawler where wanted information is stored in attribute values. That would be cumbersome and verbose, hence this PR.