USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

URL Injector/Config override and bug fixes #207

Closed buggtb closed 3 years ago

buggtb commented 3 years ago

This incoming PR will address the following requirements we had with Sparkler:

Inject Plugins Plugins on the inject side to allow users to massage URLs being fed into Sparkler. For example, we wanted to be able to submit 1 url but with a bunch of parameters and have Sparkler inject 1 url per parameter into Solr for crawling.

PUT/POST Requests Sparkler will also be able to deal with PUT/POST and GET requests, if a user prepends a URL with POST|http://us.cnn.com for example, sparkler will POST the request, this can make life easier for doing searches etc on various sites that expect a POST.

Config Overload You can now overload the config on the command line by passing in a json string with any additional or replacement keys. This allows you to deploy a sensible default config but then do per site overrides or similar in a server environment without having to rewrite config file each time for plugin changes etc.

Additional fields There is also a new metadata field that plugins can make use of. In our case, having our injector plugin deal with a selenium script and pass it to our fetcher plugin. But rather than make it specifically for selenium we thought a metadata blob would be better so other processes could make use of it, for example in our next feature.

URL Injector The URL injector plugin now allows 4 modes:

1 Replace:

Pass in a single url and replace ${token} with an item from a list. For example:

URL: http://${token}.bbc.co.uk Token list: "sport", "news"

The plugin will return 2 urls, https://news.bbc.co.uk and https://sport.bbc.co.uk for crawling.

2 Selenium:

Pass in a single url and a tokenised selenium script and have it replace the ${token} in the selenium script with the token list you pass in. Useful for selenium driven search interfaces in the fetcher-chrome plugin and others.

3 JSON:

Pass a single url and a tokenised json string and have it replace the ${token} in the string with a value and pass via the metadata field to the fetcher so that it can POST json to a URL.

4 FORM:

Submit a form field with all the bits and pieces filled in.

Chrome Fetcher Fixes I've made changes to the chrome driver fetcher which now uses an transparent proxy to allow a more effective interface. Before, because of the way selenium works we couldn't fetch the response code, so we'd do anther plain HTTP request on the java side to get that. Now I've plugged in a proxy and we can sniff the response which saves grabbing the page twice and is more efficient.

Fetcher Default Fixes When testing a site I found my headers weren't getting applied, it appears Fetcher Default is some weird half plugin half not plugin and the init() setup isn't called. I'll fix this.