alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Support for media monitoring #153

Closed pudo closed 3 years ago

pudo commented 3 years ago

Problem: we want to make media reporting, especially the articles and investigations published by OCCRP itself and it's member centres, better accessible in Aleph. At the moment, the best option to do this is by actually crawling a news web site's HTML pages and indexing all of them. This has the following issues:

In order to improve this, I've introduced an Article schema in followthemoney 2.2, which describes a piece of news reporting. It's a pretty plain form of document. We should add a module to memorious that:

Sketch:

pipeline:
  parse:
    method: article
    params:
      match:
        or:
          - pattern: .*stories.*
          - xpath: .//div[@class="published-date"]
      parse:
        title:
          - .//h1[@class="title"]
        updatedAt:
          - .//div[@class="published-date"]
    handle:
      pass: store

Since this is necessarily dependent on followthemoney, we need to decide if a) this lives in it's own Python module, or b) if it's time to make memorious depend on ftm. I could see the latter enabling us to do quite a few good things, and maybe also resolve some weird inverted dependencies (like ftm-store knowing about memorious).

Rosencrantz commented 3 years ago

WIP pull request now open here: https://github.com/alephdata/memorious/pull/167