ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Support custom Python3 script #75

Closed Arkiver2 closed 8 years ago

Arkiver2 commented 8 years ago

Support --python-script "script" to be used in grab-site with the script in the directory from where grab-site is started.

ivan commented 8 years ago

I guess this would be another approach to #29

ivan commented 8 years ago

For now I think you can make a copy of ~/.local/lib/python3.4/site-packages/libgrabsite/wpull_hooks.py and use

grab-site --wpull-args=--python-script=modified_wpull_hooks.py
Arkiver2 commented 8 years ago

~/.local/lib/python3.4/site-packages/libgrabsite/wpull_hooks.py is not used when using grab-site --wpull-args=--python-script=modified_wpull_hooks.py for the grab?

The option for custom scripts will be used for videobot.

ivan commented 8 years ago

Right, that --wpull-args=--python-script= should replace it entirely. Make sure to pass the absolute path to the script.

ivan commented 8 years ago

I really should have a better way to do this, though, because wpull_hooks.py is a lot of code.

Arkiver2 commented 8 years ago

Ok, I'll try it out. Do you think we'll have some automatic merging of a custom script and the script from grab-site in the future?

ivan commented 8 years ago

I hope someone can contribute that! It would also be helpful to know which things you end up overriding.

I think there are at least two ways to do it:

1) Do some refactoring; put the hooks in a class in libgrabsite/wpull_hooks.py; allow users to subclass it and add something like --behavior-script to grab-site.

2) Document a way to exec libgrabsite/wpull_hooks.py from your own wpull_hooks.py and then modify whatever needs changing.

ivan commented 8 years ago

Or 3) allow specifying multiple wpull_hooks.py files. Not sure if this needs changes in wpull.

ivan commented 8 years ago

I am working on something that I hope will work

ivan commented 8 years ago

This is implemented in c37b32bd1c95a39b7af92917c20d423c26b183af.

If you find that it's missing something you need (e.g. calling some function from libgrabsite's wpull_hooks.py), please file a new issue.