ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Multiple --wpull-args options don't seem to be respected #80

Open ethus3h opened 8 years ago

ethus3h commented 8 years ago

When using this:

a() { cd /home/grabbot/grabs/ && grab-site --no-dupespotter --concurrency=5 --wpull-args=--warc-move=/home/grabbot/warcdealer/\ --phantomjs-scroll=50000\ --phantomjs-exe=/phantomjs-1.9.8-linux-x86_64/bin/phantomjs\ --content-on-error "$@"; }

Doing this:

a http://fanzub.com/ --concurrency=1 --delay=3000-10000 --wpull-args="--retry-connrefused --retry-dns-error --tries=1000"

doesn't seem to respect the --content-on-error argument.

Is this intended behavior? Thanks!

ivan commented 8 years ago

Indeed, it takes only the last --wpull-args. I'll leave this open until I figure out whether they can/should be combined if used multiple times.

ivan commented 8 years ago

Are --retry-connrefused --retry-dns-error something that grab-site should have on by default?

rwoodpecker commented 8 years ago

Yes please!

ethus3h commented 8 years ago

Regarding --retry-connrefused --retry-dns-error: Not sure; if a user wants them, the user can just add them. How hard is it to remove arguments that are there by default?

I'd like to have something like:

grab-site --wpull-args="--foo=1 --bar --baz=qux" http://example.org --remove-wpull-args="--baz" --append-wpull-args="--foo=2 --blah"

and have it run like:

grab-site --wpull-args="--foo=2 --bar --blah" http://example.org

Probably to reserve backward compatibility, the current behavior of having only the final --wpull-args option respected should be retained.

12As commented 8 years ago

FYI, according to the click docs here: Sometimes, you have options that take more than one argument. For options, only a fixed number of arguments is supported.

However, combining is an option with http://click.pocoo.org/6/options/#multiple-options and that would allow you to specify them multiple times.

As for the other question, --retry-dns-error is a "yes" for me because it is a broad category that covers many things, including transient errors. --retry-connrefused is a "no" as it is much narrower and could get the unwary in trouble for repeatedly connecting to a server after being banned.