Closed liar666 closed 8 years ago
What do you mean by "crashing your production website"? Your configuration may be "valid" but still cause issues to your site if, for instance, you configure it to crawl too aggressively.
Often you will see errors generated upon startup if you have major configuration issues. Do you have a problematic example that a separate validation step would have prevented? Maybe such issue should be something to be fixed in the code.
Because people can create their own class implementations and have their own configs for it, it is not possible to validate if all options are what they should be. Also, sometimes there might be items that are lazy loaded and/or not always possible to test until it runs for real. Still, we can at a minimum ensure it is parsed properly and does not throw any obvious configuration exceptions as a separate command-line action. This can be done and I am making it a feature request.
Of course syntactical verification does not guarantee perfect working of the code (I've just verified that with setting a wrong CSS selector), but it helps spotting problems. That's indeed the case with apache2ctl's tool also.
I know that we can get the syntax errors when launching with "-a start", but in case the file is valid, the crawler is actually started, which in the case I was evoking (just checking a minimal config file validity) would not be wanted.
I do not have a specific case, but as a beginner, I like to work incrementally: e.g. write a bit of code, verify it is (almost-)OK, then iterate with another piece of code (e.g. write the Crawling config part, "verify", then the write the Importing part, "verify", then write the "Committing" part), instead of writing a full & complex config file as a one-shot process, then test.
Yes, just ensuring a minimum proper parsing would be OK for me, as a beginner. Even running exactly like it is today with "-a start" (trying to load classes, etc), but stopping right before the first step (Crawling), would be OK for me. It would even help debug/detect CLASSPATH problems.
The latest snapshot version now has this feature.
You can use -a checkcfg
and it will load your config without doing anything with it (throwing errors when something is obviously wrong).
Please confirm.
Hi,
Just a quick note to tell you that this new option just saved my [...] today!
Indeed, I did some refactoring/cleaning in my code and moved my committer class into a new directory/package. Therefore, the FQN of the class I had put in the <committer>
tag was now wrong. The new option /checkcfg/ gave me a hint about the problem, telling me it cannot find the class, whereas the /starting/ option launched a code that was doing nothing...
Saved me a lot of time (and hairs!) :)
Unfortunately, I also left an unclosed XML comment that was not detected. If you leave an opening comment "<!--"
only, there are chances that even a /start/ will raise an error, as soon as it will encounter a "--"
(eg. opening another comment). However, leaving only a closing "-->"
might not raise anything, neither at /checkcfg/ nor at /start/. Fortunately, my editor's coloring saved my [...] here...
Glad to hear that new feature worked for you.
I'll close, and we can rely on https://github.com/Norconex/importer/issues/27 for possible updates to config validation.
Hi Pascal! Is there a way to check the configuration (HttpCollectorConfig) by code ?
Yes, have a look at the code for testValidation
method in HttpCollectorConfigTest.
You'll see you can get a count of the errors. If you want to do something with the errors (e.g. store them somewhere), you can add your own ErrorHandler
to the log4j appender.
Ok, saw it, thanks.
Would be great to be able to run an action like: . ./collector-http.sh -a checkcfg -c test.xml
of -a test or -a dry-run or -a check or -a configtest
That would just (syntactically) validate the configuration file before running an erroneous crawler. In the same manner as you run: apache2ctl configtest before crashing your production website :)