apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Configuration of resources #395

Closed MyraBaba closed 7 years ago

MyraBaba commented 7 years ago

Hi,

We started to test stormcrawler. We are coming from the WW I (Nutch :).)

1 - is there any way to give command line paramethers such as:

maxdepth checkValidURI etc. which is normally located at the resources folder. If we change it from the file and compile again ( big jar) it is working. Is there any solution that doenst need recompiling ?

2 - Where we can find the full parameters list that we can use to configure all aspects of the crawler

thx

jnioche commented 7 years ago

Hi

  1. I am afraid not. These values are taken from the file only and need recompiling and restarting the topology

  2. See [https://github.com/DigitalPebble/storm-crawler/wiki/Configuration] and the default values [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/resources/crawler-default.yaml]

jnioche commented 7 years ago

I replied too quickly (blame Christmas) - if you look at [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/filtering/URLFilter.java#L40] you'll see that the URL filters get the configuration - which will be overridden by any key values passed on the command line. This means that in theory the conf set in the JSON file could be overridden by the config. The trouble is that the filters do not necessarily implement it, e.g. [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/filtering/depth/MaxDepthFilter.java].

This could be implemented of course, or you can extend a variant of the filters provided to handle that.

MyraBaba commented 7 years ago

thanks.. will look into it and let you know.