Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Log4j configuration issue #449

Closed ronjakoi closed 4 years ago

ronjakoi commented 6 years ago

Apologies for commenting on a closed issue earlier.

If I try changing the line in log4j.properties to log4j.rootLogger=INFO, FILE_ONLY and run the command from the command line, the Collector doesn't output any logs to the terminal, but I still get this warning:

log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.ResponseProcessCookies).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

This is quite strange, because clearly the properties file is being read properly, as the configuration change is taking effect (no output to terminal).

Of course I could also send stderr to /dev/null from my cronjob, but being able to receive errors and warnings by mail would be nice. Assuming warnings and errors other than this log4j issue are printed to stderr and further mailed along by cron, that is.

essiembre commented 6 years ago

I cannot reproduce. These lines are not printed for me. It looks as if the log4j.properties file is not loaded properly, or maybe there is no appender for org.apache.* in your file. Can you attach your full log4j.properties?

What if you hardcode the path to the log4j file in the collector-http.sh script? Does it make a difference?

Logging will be reworked at some point in a future release to give more flexibility to people integrating the Collector in their own solution. It may help with such issues as well. This can be tracked in https://github.com/Norconex/collector-http/issues/401 and https://github.com/Norconex/jef/issues/6.

ronjakoi commented 6 years ago

Hardcoding the path in the script had no change. If I set the path to something non-existent, I get this:

log4j:ERROR Could not read configuration file from URL [file:/foo/log4j.properties].
java.io.FileNotFoundException: /foo/log4j.properties (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileInputStream.<init>(FileInputStream.java:93)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
    at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at com.norconex.collector.core.AbstractCollector.<clinit>(AbstractCollector.java:58)
log4j:ERROR Ignoring configuration file [file:/foo/log4j.properties].
log4j:WARN No appenders could be found for logger (com.norconex.collector.core.CollectorConfigLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Here is my complete log4j.properties:

#------------------------------------------------------------------------------
# Logging Level
#------------------------------------------------------------------------------

# Set level of information printed in log file/console
# (DEBUG > INFO > WARN > ERROR > FATAL)
# By default, use INFO
log4j.rootLogger=INFO, FILE_ONLY

# Other loggers (override above default setting)

# Default loggers for the collector:
log4j.logger.com.norconex.collector.http=INFO
log4j.logger.com.norconex.collector.core=INFO
log4j.logger.com.norconex.importer=INFO
log4j.logger.com.norconex.committer=INFO

# The following are CrawlerEvent types normally logged as INFO by a crawler.
# To disable the logging of certain event types, set their log level to 
# something higher than INFO (i.e., WARN or ERROR).
# To log additional information for an event type, set its log level so 
# something lower than INFO (e.g., DEBUG). This list is 
# non-exhaustive as some crawlers may add more:
log4j.logger.CrawlerEvent.CRAWLER_STARTED=INFO
log4j.logger.CrawlerEvent.CRAWLER_RESUMED=INFO
log4j.logger.CrawlerEvent.CRAWLER_FINISHED=INFO
log4j.logger.CrawlerEvent.REJECTED_DUPLICATE=INFO
log4j.logger.CrawlerEvent.REJECTED_FILTER=DEBUG
log4j.logger.CrawlerEvent.REJECTED_UNMODIFIED=INFO
log4j.logger.CrawlerEvent.REJECTED_NOTFOUND=INFO
log4j.logger.CrawlerEvent.REJECTED_BAD_STATUS=DEBUG
log4j.logger.CrawlerEvent.REJECTED_IMPORT=DEBUG
log4j.logger.CrawlerEvent.REJECTED_ERROR=DEBUG
log4j.logger.CrawlerEvent.DOCUMENT_PREIMPORTED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_POSTIMPORTED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_COMMITTED_ADD=INFO
log4j.logger.CrawlerEvent.DOCUMENT_COMMITTED_REMOVED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_IMPORTED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_METADATA_FETCHED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_FETCHED=INFO
log4j.logger.CrawlerEvent.DOCUMENT_SAVED=INFO

log4j.logger.CrawlerEvent.REJECTED_ROBOTS_TXT=DEBUG
log4j.logger.CrawlerEvent.CREATED_ROBOTS_META=INFO
log4j.logger.CrawlerEvent.REJECTED_ROBOTS_META_NOINDEX=INFO
log4j.logger.CrawlerEvent.REJECTED_TOO_DEEP=INFO
log4j.logger.CrawlerEvent.REJECTED_CANONICAL=DEBUG
log4j.logger.CrawlerEvent.REJECTED_REDIRECTED=DEBUG
log4j.logger.CrawlerEvent.URLS_EXTRACTED=INFO

log4j.logger.org.apache=WARN
log4j.additivity.org.apache=false
#log4j.category.org.apache.velocity=WARN

# These loggers silence non-impacting errors:
log4j.logger.org.apache.pdfbox=ERROR
log4j.logger.org.apache.pdfbox.util.operator.SetTextFont=FATAL

#------------------------------------------------------------------------------
# APPENDER: FILE_ONLY
#------------------------------------------------------------------------------
# The collector programmatically adds a file appender.  To only use that file,
# specify this "FILE_ONLY" appender as the "log4j.rootLogger" instead of 
# the default "CONSOLE" value and it will ignore the console.
#
log4j.appender.FILE_ONLY=org.apache.log4j.varia.NullAppender

#------------------------------------------------------------------------------
# APPENDER: CONSOLE
#------------------------------------------------------------------------------
# Setup and adjust format for logging to console
# (Format example: "DEBUG [Class.method]: Here is the msg. "
# This is then followed by a stack trace, if an Exception was provided)
# NOTE: Using %M can be slow - it should only be used for debugging
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
#log4j.appender.CONSOLE.layout.ConversionPattern=%-5p [%C{1}.%M] %m%n
log4j.appender.CONSOLE.layout.ConversionPattern=%-5p [%C{1}] %m%n
essiembre commented 6 years ago

Sorry I still can't reproduce. I am afraid you'll have to live with that warning until the logging mechanism changes.

One last idea: try to move the log4j.properties files under the "classes" folder (which is part of the class loader when you use the launch script).

Krishna210414 commented 5 years ago

Hi can i prevent creating one big output logger file it seems when added the file property its not creating multiple log files, Can some one please provide configuration if they did it already?

Also I need to change the name of file.?

essiembre commented 5 years ago

To reduce the log size, you can change the log levels in the log4j.properties. The log file is backed up automatically at the beginning of next run.

The next major version (3.x) will give you more flexibility with logging (relying on SLF4J). Until then, I would recommend you modify the launch script to modify the file(s) as you want between runs.

essiembre commented 5 years ago

You can now rely 100% on your own log4j configuration with the latest snapshot release. Have a look at: https://github.com/Norconex/collector-http/issues/593#issuecomment-485035899 to find out how.

jetnet commented 3 years ago

this helps for collector 2.9:

log4j.logger.org.apache.http.client.protocol.ResponseProcessCookies=FATAL