Context
The mechanism that loads crawler configurations runs a validator over the provided XML which will warn the developer about syntax errors. Unfortunately these warnings are missing any context and location information. In a large project, it is thus very bothersome to find out the exact cause of the warning.
Ideal solution
It would be great if the warning messages could be qualified by a line number, file name, or even just the ID of the crawler config in which they occur.
Suggestion
My workaround so far is to increase the log level of Norconex to DEBUG in order to see the information "Crawler configuration loaded: x" subsequent to the warnings, which helps me to localize the issue. However, on the DEBUG level there is far too much noise being printed out.
(1). My first suggestion would be to print the mentioned "Crawler configuration loaded: x" message (CrawlerConfigLoader.java, line 83) already on the INFO level, as I find this information much more important than other messages on the DEBUG level.
(2). Furthermore, I suggest to change the following log messages of the AbstractCrawlerConfig and the collector-http module to be printed on the DEBUG level, as they appear to be less important than the loading of an entire crawler configuration:
Link extractor loaded: x
HTTP document processor loaded: x
Start URLs provider loaded: x
Crawler event listener loaded: x
Reference filter loaded: x
Document filter loaded: x
Matadata filter loaded: x
If we agree on a solution, I can submit a PR containing the relevant changes.
Context The mechanism that loads crawler configurations runs a validator over the provided XML which will warn the developer about syntax errors. Unfortunately these warnings are missing any context and location information. In a large project, it is thus very bothersome to find out the exact cause of the warning.
Ideal solution It would be great if the warning messages could be qualified by a line number, file name, or even just the ID of the crawler config in which they occur.
Suggestion My workaround so far is to increase the log level of Norconex to DEBUG in order to see the information "Crawler configuration loaded: x" subsequent to the warnings, which helps me to localize the issue. However, on the DEBUG level there is far too much noise being printed out.
(1). My first suggestion would be to print the mentioned "Crawler configuration loaded: x" message (CrawlerConfigLoader.java, line 83) already on the INFO level, as I find this information much more important than other messages on the DEBUG level.
(2). Furthermore, I suggest to change the following log messages of the AbstractCrawlerConfig and the collector-http module to be printed on the DEBUG level, as they appear to be less important than the loading of an entire crawler configuration:
If we agree on a solution, I can submit a PR containing the relevant changes.