disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link
0archive

ZeroScraper

Scraper for news websites, content farms, ptt, and dcard forums.

0archive is scraping websites provided in this target list.

You could set up your website list by following the instruction in AIRTABLE.md.

Setup

We use MySQL. To setup database connections, copy .env.default to .env, and change DB_URL to your connection string. The connection string should start with mysql+pymysql:// so that sqlalchemy uses the correct driver.

We use Python 3.7, and Pipenv to manage Python packages. Install Python packages with:

$ pip install pipenv
$ pipenv install

The database table of all websites can get updates from Airtable, if you followed above instructions to setup one. You need an API key from Airtable (generate yours here), and the id of your base (find yours here). Then add the following variables to .env:

AIRTABLE_BASE_ID={id_of_your_airtable_base}
AIRTABLE_API_KEY={your_api_key}
SITE_TYPES=["{site_type_1}", "{site_type_2}",...]

Then, run the following commands to finish the setup:

$ pipenv shell          # start a shell in the virtual env
$ invoke migrate        # run database migrations
$ invoke update-sites   # update your site table

Run

The following commands assume that you're in the virtual env. Run pipenv shell before start.

  1. Make sure you have a list of websites to crawl:

    python zs-site.py list
  2. Find new articles for a single site listed in Site table in database and store general info to Article and raw html to ArticleSnapshot table:

    $ python zs-site.py discover {site-id}
    Optional Arguments:
        # crawler config
        --depth: maximum search depth limit. default = 5.
        --delay: delay time between each request. default = 1.5 (sec)
        --ua: user agent string. default is the chrome v78 user-agent string.
    
        # site config
        --url: url to start with for this crawl
        --article: regex of article url pattern, e.g. '/story/(\d+).html'
        --following: regex of following url pattern, e.g. 'index/(\d\d+).html'
  3. Find new articles for all ACTIVE sites listed in Site table in database. Activity is determined by 'is_active' column in airtable.

    $ python zs.py discover
    Optional Arguments:
            --limit-sec: time limit to run in seconds
    
    Site-specific arguments (depth, delay, and ua) should be specified in 'config' column of Site table.
    Otherwise the default values will be used.
  4. Revisit articles in database based on next_snapshot_at parameter in Article Table on the mysql database. The function will save new html to ArticleSnapshot table and update the snapshot parameters in Article Table.

    # update all articles
    $ python zs.py update
    Optional Arguments:
            --limit-sec: time limit to run in seconds
  5. Revisit articles in a specified site.

    $ python zs-site.py update {site-id}
    Optional Arguments:
            --delay: delay time between each request. default = 1.5 (sec)
            --ua: user agent string.
  6. Revisit one article regardless of next_snapshot_time or snapshot_count.

    $ python zs-article.py update {article-id}
    Optional Arguments:
            --selenium: use selenium to load the article.
  7. Discover a new article that does not exist in DB based on a provided url.

    $ python zs-article.py discover {url}
    Optional Arguments:
            --site-id: id of site of which the url belongs to. default = 0
            --selenium: use selenium to load the article.

Hack

We use Python 3.7, and Pipenv to manage Python packages. Install Python packages with:

$ pip install pipenv
$ pipenv install --dev
# only the first time
$ pre-commit install

Generally following the way Scrapy projects are structured, ZeroScraper consists of the following components:

Operate

Dump snapshot table

$ zs-dump.py --table ArticleSnapshotYYYYMM --output YYYYMM.jsonl