Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

JEF Monitor and new HTTP norconex Crawler (3.0.*) #801

Closed evaso closed 1 year ago

evaso commented 2 years ago

We use Norconex JEF Monitor (4.0.6-SNAPSHOT) together with the Norconex HTTP crawler (version 2.9) and are very happy with it. We are now in the process of installing Norconex version 3.0.1 in our systems and have found that the corresponding log files (*.index) under /output./progres/latest/, which are used for monitoring, are no longer generated . Since the JEF monitor is a very important tool for us for monitoring crawling processes, I would just like to ask whether it might be possible to create the corresponding log files in the new version as well. Is that still possible now? What adjustments would be necessary for this and if that should no longer be possible, what alternatives would we have available? Thanks in advance.

essiembre commented 1 year ago

Hello @evaso, it looks like your post fell through the cracks.

Logs are now written to the STDOUT by default. More and more people run the crawler in containers and STDOUT is a favored approach. To work best with JEF Monitor we had to impose a strict format with meant extending Log4J programmatically and in the end, it made it more difficult for many people to deal with the logs in their own way or to replace the logger implementation. V3 relies on SLF4J (shipping with Log4J2 implementation, which one can replace).

You can have it generate logs to the file system again by using a file-based appender. You can modify the log4j2.xml file that should be present in your install directory. To control the precise location and format of the logs (trying to reach compatibility), you can have a look at the Log4J2 documentation: https://logging.apache.org/log4j/2.x/manual/configuration.html I am not sure this will be enough though, as JEF Monitor was also relying on *.job files that may not have an equivalent.

There are all-purpose monitoring tools out there. Some focus on indexing/monitoring of the logs, others monitor the processes, etc. The crawler has JMX support built-in, so one option is to track progress via Prometheus, using a JMX exporter agent that you attach the the crawler: https://github.com/prometheus/jmx_exporter. To enable JMX support in the crawler, you would pass the following as a Java System Property: -DenableJMX=true. .

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

evaso commented 1 year ago

Thanks for the info Pascal. We're still not ready to try the recommended configuration. But I'll write a feedback about it as soon as the time comes. Thanks again.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.