Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
184 stars 67 forks source link

Log manager misconfiguration #62

Closed AntonioAmore closed 9 years ago

AntonioAmore commented 9 years ago

I have a configuration with line

 #set($workdir = "/home/crawlers/www.somesite.com/workdir")
<logsDir>${workdir}/logs</logsDir>

When I launch the crawler first time it works perfectly (means no workdir created yet), but on the second launch (-a start) it delivers following error:

com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.jef4.log.FileLogManager".
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:177)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:321)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:223)
    at com.norconex.jef4.status.JobSuiteStatusSnapshot.newSnapshot(JobSuiteStatusSnapshot.java:179)
    at com.norconex.jef4.suite.JobSuite.getJobStatus(JobSuite.java:154)
    at com.norconex.jef4.suite.JobSuite.getJobStatus(JobSuite.java:146)
    at com.norconex.jef4.suite.JobSuite.getStatus(JobSuite.java:139)
    at com.norconex.collector.core.AbstractCollector.stop(AbstractCollector.java:131)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:73)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: com.norconex.jef4.JEFException: Cannot create log directory: /var/www/Norconex/jef/workdir/latest/logs
    at com.norconex.jef4.log.FileLogManager.resolveDirs(FileLogManager.java:114)
    at com.norconex.jef4.log.FileLogManager.<init>(FileLogManager.java:85)
    at com.norconex.jef4.log.FileLogManager.<init>(FileLogManager.java:74)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
    at java.lang.Class.newInstance(Class.java:433)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:175)
    ... 9 more
Caused by: java.io.IOException: Unable to create directory /var/www/Norconex/jef/workdir/latest/logs
    at org.apache.commons.io.FileUtils.forceMkdir(FileUtils.java:2384)
    at com.norconex.jef4.log.FileLogManager.resolveDirs(FileLogManager.java:112)
    ... 17 more

Why it try create Norconex/jef directory?

essiembre commented 9 years ago

Because the collectors use Norconex JEF (Job Execution Framework). That's the directory it uses to store logs and/or progress.

I see you have set it to be something different than /var/www/Norconex/. I suspect your <logsDir> value is not being considered. Where did you put it in your config?

The progress and log directories are not specific to each crawlers, but to the whole collector. Can you copy your whole config? See here to find out where they belong.

AntonioAmore commented 9 years ago
<httpcollector id="${machinereadablename}">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")

  #set($workdir = "${crawlerdir}/workdir")y
  #set($configdir = "${crawlerdir}/config")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlers>
...

Seems the parameter is on place. Why the error appears only on 2ns run of the crawler, when {$workdir} already exists?

AntonioAmore commented 9 years ago

I detected it tries to create the dir {user home}/Norconex/jef/workdir/

essiembre commented 9 years ago

Not sure why you get this on second run only. It is the first time something like that is being reported. Just to make sure:

JEF only falls back to the user home directory when no path has been specified. So for some reason the path is not being set.

AntonioAmore commented 9 years ago

May JEF have problems when trying to clean or backup previous session files, and fall to default path because exception?

essiembre commented 9 years ago

Did you change some config values in-between two runs? What if you try to clean your working directory and try again? If you attach your entire config maybe I can try to reproduce.

essiembre commented 9 years ago

Do you still have this problem? Can this issue be closed? FYI, Norconex HTTP Collector 2.1.0 was released. If you still have the issue, try with that version and confirm.

AntonioAmore commented 9 years ago

As v.2.1.0 released, I think it is better for me switch to this version and test it. I'll provide feedback nearest time. If the problem gone with the new release, we may close the issue

AntonioAmore commented 9 years ago

I have the same issue with the 2.2.0 recent snapshot - when I delete workdir, it runs perfectly, else I got messages like:

INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
FATAL [JobSuite] Job suite execution failed: www.site.com
com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.jef4.log.FileLogManager".
 at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:181)
 at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:329)
 at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:228)
...

Looks similar to the subj, so seems a bug.

essiembre commented 9 years ago

When you delete the folder manually, are you using the same user account as the one used to run the collector? In other words is the user running the collector having the right permissions to create new directories (as per the first exception reported: Caused by: java.io.IOException: Unable to create directory /var/www/Norconex/jef/workdir/latest/logs)?

I was not able to reproduce yet. What is your operating system?

AntonioAmore commented 9 years ago

I use Ubuntu Linux.

And why it search for /var/www/Norconex? I don't use the path in any config, and never planned to do so - it looks like a default value, hardcoded to sources and used when workdir already exists.

essiembre commented 9 years ago

When no working directory is specified, it uses the user home directory + Norconex. In your case, since you have given it a working directory, it should not write there. Please paste your entire config to help reproduce (or email it if sensitive).

essiembre commented 9 years ago

Assuming you did not change the default log level for it, if you look at the generated logs, near the beginning, do you see an INFO statement that look like this:

INFO  [AbstractCollectorConfig] Configuration loaded: id=yourId; logsDir=/crawlerdir/workdir/logs; progressDir=/crawlerdir/workdir/progress

What is detected for the logsDir? It should tell you if it is loaded properly from your config in the first place (unless they error you are getting occurs before this gets printed).

I doubt this could be it, but are you using a variable file? If so, is it one you are explicitly passing as an argument to the collector startup shell script? If so, do you always pass it?

martinfou commented 9 years ago

I think your problem is with file permission. Can you give my config file a try and see if it works for you? I tested them on linux lubuntu 14

file minimum-config.xml https://gist.github.com/martinfou/a08682d98efeeecce9ad

file minimum-config.variables https://gist.github.com/martinfou/b2eb5155e60e6b9f4003

AntonioAmore commented 9 years ago

I checked with the config - it runs correctly and delivers following warning message on any launch with any config, including default:

log4j:WARN No appenders could be found for logger (org.apache.velocity.runtime.log.Log4JLogChute).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Why it tries to create directory /home/{username}/Norconex/jef/workdir/latest/logs when workdir in the config is set to <workDir>${workdir}</workDir> (workdir = /home/username/projects/norconex/test/)? The directory is always empty.

On my opinion collector-http should create service directories at ${workdir} - it is possible someone wants /home will restricted to the collector, or run it under another user's name.

I should note in addition, that ...Norconex/jef... directory is created on a second crawler launch - when ${workdir} already exists.

essiembre commented 9 years ago

How do you start the Collector? Are you using one of the launch script that's packaged in the zip you have downloaded? Because out-of-the-box it should be pointing to the log4j.properties file found in the /classes/ folder where you unpacked the collector zip file. The warnings you get suggests it is not being picked up.

AntonioAmore commented 9 years ago

Yes, I use collector-http.sh from the zip package downloaded. I launch it using as absolute, as relative paths to the .sh script and to the config, as from collector's directory, as from a random dir.

I'll check access rights to /classes/ and comment here.

AntonioAmore commented 9 years ago

The user may read ./classes, as any other collector-http subdirectory. The workdir, described in crawler's config is writeable.

You may try to reproduce the error with following flow:

  1. Create a user and decline him write access to his home.
  2. Run collector with custom workdir, as a writeable directory out of the user's home.
  3. Write down collector's log timestamp and take a pause.
  4. Run the collector again - log file timestamp hasn't changed.
  5. Let the user write to his home.
  6. Run collector again.
  7. See /home/{username}/Norconex/jef/workdir/latest/logs and log file at the workdir with changed timestamp.

I believe there is enough writable workdir for a crawler to run correctly.

If you wish I may send you my configs to email, as I want to keep them private.

essiembre commented 9 years ago

Your config would definitely help, since the behavior you describe should not happen with a config where workdir and logsDir are defined properly. You can check my github profile for my email.

essiembre commented 9 years ago

I think I finally was able to reproduce. It seems like it attempts to create the logs directory under the default location (under user home directory) no matter what you specify in the config.

In my case it does not write any files under that user home location, and the files are instead written where they should as per configuration, but that extra (empty) log directory is created and should not.

Marking this as a bug.

essiembre commented 9 years ago

I just made a snapshot release that should no longer create directories in the default location if you have specified otherwise. Let me know if that resolves the issue for you.

AntonioAmore commented 9 years ago

Thanks a lot! I'll test the new snapshot and provide feedback in a day.

AntonioAmore commented 9 years ago

I confirm the bug is fixed. Everything works perfectly. Thank you again for your help!

essiembre commented 9 years ago

Thanks for confirming!

essiembre commented 9 years ago

Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.

ronjakoi commented 6 years ago

Cron mails me this warning every night:

log4j:WARN No appenders could be found for logger (org.apache.pdfbox.pdmodel.font.PDCIDFontType2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

I am running the collector from a /etc/cron.d/norconex like this:

15 23 * * Mon-Fri crawler cd /opt/norconex/collector-http/ && ./collector-http.sh -c ../intranet/daily/conf/crawler.xml -a resume > /dev/null

The logs are produced under /opt/norconex/intranet/daily/work/logs/ just fine, though. The file /opt/norconex/collector-http/log4j.properties also exists. Both this file and the shell script are untouched from the zip file, no modifications. I also can't see any file permission problems.

I am running HTTP Collector version 2.7.1.

essiembre commented 6 years ago

Strange. What if you change this log4j line:

log4j.rootLogger=INFO, CONSOLE

to

log4j.rootLogger=INFO, FILE_ONLY

Does it make a difference?

If not, do you also get the error when you run it directly from the command-line?

Please note that issue was already closed. Please create another one or your questions may get lost.