Configuration of resources

MyraBaba commented 7 years ago

Hi,

We started to test stormcrawler. We are coming from the WW I (Nutch :).)

1 - is there any way to give command line paramethers such as:

maxdepth checkValidURI etc. which is normally located at the resources folder. If we change it from the file and compile again ( big jar) it is working. Is there any solution that doenst need recompiling ?

2 - Where we can find the full parameters list that we can use to configure all aspects of the crawler

thx

jnioche commented 7 years ago

Hi

I am afraid not. These values are taken from the file only and need recompiling and restarting the topology
See [https://github.com/DigitalPebble/storm-crawler/wiki/Configuration] and the default values [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/resources/crawler-default.yaml]

jnioche commented 7 years ago

I replied too quickly (blame Christmas) - if you look at [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/filtering/URLFilter.java#L40] you'll see that the URL filters get the configuration - which will be overridden by any key values passed on the command line. This means that in theory the conf set in the JSON file could be overridden by the config. The trouble is that the filters do not necessarily implement it, e.g. [https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/filtering/depth/MaxDepthFilter.java].

This could be implemented of course, or you can extend a variant of the filters provided to handle that.

MyraBaba commented 7 years ago

thanks.. will look into it and let you know.

apache / incubator-stormcrawler

Configuration of resources #395