disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Message queue to coordinate data processing #134

Open pm5 opened 3 years ago

pm5 commented 3 years ago

As we get more and more scraped articles and snapshots, it'll become slower to select the data to work on at each step of the processing, e.g. selecting all new snapshots to parse with SQL would be slow and hard to run in parallel.

A message queue is needed to coordinate processing work and to easily generate processing logs and stats. Kafka is good and popular but adds up to the maintenance costs. An alternative would be to use existing MySQL as transport and storage for the queue, because we need persistent messages to ensure correct data output.

We can use Kombu from the Celery project, which supports an SQLAlchemy transport to save the messages in MySQL. Kombu can do simple standard-library-style queues and more complex workers models.

We can work on the message formats and flows first until we've decoupled ArticleParser from ZeroScraper's database schema at Python level. Then there would be multiple paths to choose from for future works such as multiple ArticleParsers in parallel, tracking processing tasks and dashboard, lesser database maintenance workload (when we can easily tell if a set of snapshots are all processed successfully), separate publishers from ArticleParsers, etc.

pm5 commented 3 years ago

The whole data flow could look like the following initially:

                +----------+
                | Web page |
                +----+-----+
                     |
                     |
                     |
                     | scrape scrape
                     v
              +-------------+                +---------------+               +---------------+
              |             |                |               |               |               |
              | ZeroScraper |                | ArticleParser |               | DataPublisher |
              |             |                |               |               |               |
              +--+----+-----+                +------------+--+               +-----------+---+
                 |    |                           ^   ^   |                      ^   ^   |
        Article  |  +-|---------------------------+   |   |                      |   |   |
 ArticleSnapshot |  | +-------+     +-----------------+   |    +-----------------|---+   |
                 |  |         |     |    +---------------------------------------+       |
                 v  |         v     v    |                v    |                         v
    +---------------+--+  +--------------+-----+    +----------+----+       +---------------+
    |    ScraperDb     |  |    KombuQueue      |    |   ParserDb    |       |  DataStorage  |
    |------------------|  |--------------------|    |---------------|       |---------------|
    |     Article      |  |   NewSnapshot      |    |   Publisher   |       |   datasets    |
    |  ArticleSnapshot |  |   NewPublication   |    |  Publication  |       |               |
    +------------------+  +--------------------+    +---------------+       +---------------+

One difference from now is the ScraperDb ---> ArticleParser flow. Since KombuQueue provide article ID and snapshot ID for all incoming snapshots, ScraperDb no longer needs to provide an interface for ArticleParser to query for a range of data (done by a SELECT query now). Instead an API to get snapshots by their keys is enough. This query is fast.