disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Requirements for streamlined publisher #108

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

Some thoughts:

  1. Push changes to https://github.com/disinfoRG/datasets repo incrementally.
  2. Upload packaged dataset files to Google Drive daily.
  3. Upload newly published data to Elasticsearch incrementally.

(1) and (3) are by parsing time. (2) is by producer. We may have to publish every few hours considering the data size.

pm5 commented 4 years ago

Oops this belongs to ArticleParser.