alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Parse and store metadata before emiting urls to fetch in parse operation #121

Closed sunu closed 4 years ago

sunu commented 4 years ago

... to make sure metadata gets passed to the next stage along with the fetched content.

Here's a crawler config to test run the example in docs (https://memorious.readthedocs.io/en/latest/buildingcrawler.html#parse)

name: glitch_parse
description: Parse metadata test
pipeline:
  init:
    method: seed
    params:
      urls: 
        - https://uncovered-calico-random.glitch.me/
    handle:
      pass: fetch
  fetch:
    method: fetch
    handle:
      pass: parse 
  parse:
    method: parse
    params:
      store:
        mime_group: documents
      include_paths:
        - './/article'
      meta:
        creator: './/article/p[@class="author"]'
        title: './/h1'
      meta_date:
        published_at: './/article/time'
        updated_at: './/article//span[@id="updated"]'
    handle:
        fetch: fetch
        store: store
  store:
    method: inspect
sunu commented 4 years ago

Side note: the test for parse doesn't really cover the case we're dealing with in this PR. Wonder if it's better to spin a local server to test against.