disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

ns-dump and ns-load commands #103

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

Per discussions here we need a CLI tool to dump and load data to and from textual format. For snapshots we should have JSONLines dumps with the following tools:

usage: ns-dump.py [-h] -t TABLE [-o OUTPUT] [-r DATE_RANGE]

dump article snapshot table data into JSONLines format

optional arguments:
  -h, --help            show this help message and exit
  -t TABLE, --table TABLE
                        name of the snapshot table to dump
  -o OUTPUT, --output OUTPUT
                        output filename; to STDOUT if not provide
  -r DATE_RANGE, --date-range DATE_RANGE
                        select only snapshots taken in given date range
                        specified in '<start_date>:<end_date>' or
                        '<duration>:<end_date>'; date format must be 'YYYY-MM-
                        DD'; duration may be '<n>d', '<n>w'.

and

usage: ns-load.py [-h] -t TABLE [-i INPUT]

load article snapshot table data from JSONLines format

optional arguments:
  -h, --help            show this help message and exit
  -t TABLE, --table TABLE
                        name of the snapshot table to load
  -i INPUT, --input INPUT
                        input filename; to STDIN if not provided

For articles and sites we can use mysqldump. A full dump at the moment uses only 82MB when bzipped.

pm5 commented 4 years ago

Done.