Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Add an action checkconfig? #254

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Would be great to be able to run an action like: . ./collector-http.sh -a checkcfg -c test.xml

of -a test or -a dry-run or -a check or -a configtest

That would just (syntactically) validate the configuration file before running an erroneous crawler. In the same manner as you run: apache2ctl configtest before crashing your production website :)

essiembre commented 8 years ago

What do you mean by "crashing your production website"? Your configuration may be "valid" but still cause issues to your site if, for instance, you configure it to crawl too aggressively.

Often you will see errors generated upon startup if you have major configuration issues. Do you have a problematic example that a separate validation step would have prevented? Maybe such issue should be something to be fixed in the code.

Because people can create their own class implementations and have their own configs for it, it is not possible to validate if all options are what they should be. Also, sometimes there might be items that are lazy loaded and/or not always possible to test until it runs for real. Still, we can at a minimum ensure it is parsed properly and does not throw any obvious configuration exceptions as a separate command-line action. This can be done and I am making it a feature request.

liar666 commented 8 years ago

Of course syntactical verification does not guarantee perfect working of the code (I've just verified that with setting a wrong CSS selector), but it helps spotting problems. That's indeed the case with apache2ctl's tool also.

I know that we can get the syntax errors when launching with "-a start", but in case the file is valid, the crawler is actually started, which in the case I was evoking (just checking a minimal config file validity) would not be wanted.

I do not have a specific case, but as a beginner, I like to work incrementally: e.g. write a bit of code, verify it is (almost-)OK, then iterate with another piece of code (e.g. write the Crawling config part, "verify", then the write the Importing part, "verify", then write the "Committing" part), instead of writing a full & complex config file as a one-shot process, then test.

Yes, just ensuring a minimum proper parsing would be OK for me, as a beginner. Even running exactly like it is today with "-a start" (trying to load classes, etc), but stopping right before the first step (Crawling), would be OK for me. It would even help debug/detect CLASSPATH problems.

essiembre commented 8 years ago

The latest snapshot version now has this feature.

You can use -a checkcfg and it will load your config without doing anything with it (throwing errors when something is obviously wrong).

Please confirm.

liar666 commented 8 years ago

Hi,

Just a quick note to tell you that this new option just saved my [...] today!

Indeed, I did some refactoring/cleaning in my code and moved my committer class into a new directory/package. Therefore, the FQN of the class I had put in the <committer> tag was now wrong. The new option /checkcfg/ gave me a hint about the problem, telling me it cannot find the class, whereas the /starting/ option launched a code that was doing nothing... Saved me a lot of time (and hairs!) :)

Unfortunately, I also left an unclosed XML comment that was not detected. If you leave an opening comment "<!--" only, there are chances that even a /start/ will raise an error, as soon as it will encounter a "--" (eg. opening another comment). However, leaving only a closing "-->" might not raise anything, neither at /checkcfg/ nor at /start/. Fortunately, my editor's coloring saved my [...] here...

essiembre commented 8 years ago

Glad to hear that new feature worked for you.

I'll close, and we can rely on https://github.com/Norconex/importer/issues/27 for possible updates to config validation.

sylvainroussy commented 7 years ago

Hi Pascal! Is there a way to check the configuration (HttpCollectorConfig) by code ?

essiembre commented 7 years ago

Yes, have a look at the code for testValidation method in HttpCollectorConfigTest.

You'll see you can get a count of the errors. If you want to do something with the errors (e.g. store them somewhere), you can add your own ErrorHandler to the log4j appender.

sylvainroussy commented 7 years ago

Ok, saw it, thanks.