ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Add config file using configparser #91

Closed 12As closed 7 years ago

12As commented 8 years ago

Python has a parser for config (also known as .ini files, among other names). I think it would allow more complexity without having to clutter up everything with environmental variables and options. I've run into a couple situations where that might be useful. For example, configuration of SSL parameters and certificates, deque memory size specification, default options for jobs. This could be done by stripping parts of the domain name until there is a match or using the default if not.

One thing that I'm not sure of is where it should go and which should win out if multiple are found. I was thinking about the user's home directory, /etc, XDG_CONFIG_HOME and the root of virtual environments.

An example file is below:

[grab-site]
port=29000
interface=127.0.0.1
concurrency=6
global_ignore_sets=global
youtube-dl_path=/home/grab-site/youtube-dl-latest/bin/youtube-dl
finished-warc-dir=/WARCsGoHere

[reddit.com]
concurrency=2
additional_ignore_sets=reddit

[yahoo.com]
concurrency=11
delay=0
ivan commented 7 years ago

I don't think I'm ever going to do this. grab-site is really supposed to work without thinking up per-site arguments, and where this isn't the case (e.g. --no-dupespotter or --igsets=reddit) they are bugs/limitations. I would recommend writing shell functions that spawn grab-site with the arguments you need. You can have multiple functions for spawning grab-site with different arguments.