alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

ENH metadata can be defined on attribute values #99

Closed moreymat closed 4 years ago

moreymat commented 4 years ago

This PR enables to define metadata whose value is parsed from attribute values.

I needed to replace find() with xpath() because attribute selection requires the full XPath specification and not the subset supported by the ElementTree API (https://lxml.de/3.0/FAQ.html#what-are-the-findall-and-xpath-methods-on-element-tree). To be consistent I also replaced occurrences of findall()and findtext() in parse and clean.

One can now define, in the YAML config file, a metadata field named source whose value is defined as that of the attribute src of an img :

name: test
description: test
# schedule: weekly
pipeline:
  init:
    [...]
  parse:
    method: parse
    params:
      [...]
      meta:
        source: './/div[@class="entry"]/img/@src'

The alternative would be to define custom parse and parse_for_metadata functions in each crawler where wanted information is stored in attribute values. That would be cumbersome and verbose, hence this PR.

moreymat commented 4 years ago

@sunu If a PR with a narrower scope is preferrable, what I need the most is the replacement of calls to find, findall and findtext with their (functional) equivalents using xpath, in the parse and clean_html operations.

sunu commented 4 years ago

Hey @moreymat, I was waiting to check whether the changes could break any existing crawlers. But a quick read through the docs suggests the syntax supported by find and findall is a subset of the full xpath spec. So it should be fine. I'll merge and do make a release shortly. Thanks a lot for the PR.

moreymat commented 4 years ago

@sunu it's slightly more complicated: the ElementTree API supports a subset of the XPath specification plus idiosyncratic notation for namespaces:

Another important difference is namespace handling, which uses the {namespace}tagname notation. This is not supported by XPath.

(https://lxml.de/3.0/FAQ.html#xpath-and-document-traversal)

Curly braces are used for example for DAV in fetch, so you might have missed a few breaking points :-/