ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Call Script To Rewrite URL Matching Specific Regex? #105

Closed brandongalbraith closed 6 years ago

brandongalbraith commented 6 years ago

Is it possible to call an external script to rewrite URLs that match a certain regex pattern? Retrievals from Photobucket URLs requires the use of PB_Shovel to obtain the URL of the image to retrieve to sidestep Photobucket's image ransom.

Example logs:

2017-10-24 02:32:35,092 - wpull.processor.web - INFO - Fetching ‘http://s139.photobucket.com/user/clintcummins/media/08614mcnfapron/IMG_3083.jpg.html’.
2017-10-24 02:32:35,439 - wpull.processor.web - INFO - Fetched ‘http://s139.photobucket.com/user/clintcummins/media/08614mcnfapron/IMG_3061.jpg.html’: 200 OK. Length: unspecified [text/html; charset=utf-8].
2017-10-24 02:32:35,488 - wpull.processor.web - INFO - Fetching ‘http://s139.photobucket.com/albums/q317/clintcummins/08614mcnfapron/IMG_3095.jpg’.
2017-10-24 02:32:35,609 - wpull.processor.web - INFO - Fetched ‘http://s139.photobucket.com/albums/q317/clintcummins/08614mcnfapron/IMG_3095.jpg’: 302 Found. Length: 268 [text/html; charset=iso-8859-1].
2017-10-24 02:32:35,624 - wpull.processor.web - INFO - Fetching ‘http://s139.photobucket.com/user/clintcummins/media/08614mcnfapron/IMG_3095.jpg.html’.
2017-10-24 02:32:35,706 - wpull.processor.web - INFO - Fetched ‘http://s139.photobucket.com/user/clintcummins/media/08614mcnfapron/IMG_3083.jpg.html’: 200 OK. Length: unspecified [text/html; charset=utf-8].
2017-10-24 02:32:35,753 - wpull.processor.web - INFO - Fetching ‘http://s139.photobucket.com/albums/q317/clintcummins/08614mcnfapron/IMG_3093.jpg’.
2017-10-24 02:32:35,843 - wpull.processor.web - INFO - Fetched ‘http://s139.photobucket.com/albums/q317/clintcummins/08614mcnfapron/IMG_3093.jpg’: 302 Found. Length: 268 [text/html; charset=iso-8859-1].
2017-10-24 02:32:35,850 - wpull.processor.web - INFO - Fetching ‘http://s139.photobucket.com/user/clintcummins/media/08614mcnfapron/IMG_3093.jpg.html’.
ivan commented 6 years ago

It is possible: I have added an example that does this to custom_hooks_sample.py. Hopefully you will find that useful for writing your own hook. Thanks for the good question.

brandongalbraith commented 6 years ago

Thank you @ivan!