Closed AntonioAmore closed 9 years ago
Because the collectors use Norconex JEF (Job Execution Framework). That's the directory it uses to store logs and/or progress.
I see you have set it to be something different than /var/www/Norconex/. I suspect your <logsDir>
value is not being considered. Where did you put it in your config?
The progress and log directories are not specific to each crawlers, but to the whole collector. Can you copy your whole config? See here to find out where they belong.
<httpcollector id="${machinereadablename}">
#set($http = "com.norconex.collector.http")
#set($core = "com.norconex.collector.core")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($workdir = "${crawlerdir}/workdir")y
#set($configdir = "${crawlerdir}/config")
<progressDir>${workdir}/progress</progressDir>
<logsDir>${workdir}/logs</logsDir>
<crawlers>
...
Seems the parameter is on place. Why the error appears only on 2ns run of the crawler, when {$workdir} already exists?
I detected it tries to create the dir {user home}/Norconex/jef/workdir/
Not sure why you get this on second run only. It is the first time something like that is being reported. Just to make sure:
${crawlerdir}
variable defined?JEF only falls back to the user home directory when no path has been specified. So for some reason the path is not being set.
${crawlerdir}
is defined at .properties file, which has similar to config name;y
is just a random symbol, appeared while I arranged the example, sorry.May JEF have problems when trying to clean or backup previous session files, and fall to default path because exception?
Did you change some config values in-between two runs? What if you try to clean your working directory and try again? If you attach your entire config maybe I can try to reproduce.
Do you still have this problem? Can this issue be closed? FYI, Norconex HTTP Collector 2.1.0 was released. If you still have the issue, try with that version and confirm.
As v.2.1.0 released, I think it is better for me switch to this version and test it. I'll provide feedback nearest time. If the problem gone with the new release, we may close the issue
I have the same issue with the 2.2.0 recent snapshot - when I delete workdir, it runs perfectly, else I got messages like:
INFO [JobSuite] JEF log manager is : FileLogManager
INFO [JobSuite] JEF job status store is : FileJobStatusStore
INFO [AbstractCollector] Suite of 1 crawler jobs created.
INFO [JobSuite] Initialization...
FATAL [JobSuite] Job suite execution failed: www.site.com
com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.jef4.log.FileLogManager".
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:181)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:329)
at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:228)
...
Looks similar to the subj, so seems a bug.
When you delete the folder manually, are you using the same user account as the one used to run the collector? In other words is the user running the collector having the right permissions to create new directories (as per the first exception reported: Caused by: java.io.IOException: Unable to create directory /var/www/Norconex/jef/workdir/latest/logs
)?
I was not able to reproduce yet. What is your operating system?
I use Ubuntu Linux.
And why it search for /var/www/Norconex? I don't use the path in any config, and never planned to do so - it looks like a default value, hardcoded to sources and used when workdir already exists.
When no working directory is specified, it uses the user home directory + Norconex. In your case, since you have given it a working directory, it should not write there. Please paste your entire config to help reproduce (or email it if sensitive).
Assuming you did not change the default log level for it, if you look at the generated logs, near the beginning, do you see an INFO
statement that look like this:
INFO [AbstractCollectorConfig] Configuration loaded: id=yourId; logsDir=/crawlerdir/workdir/logs; progressDir=/crawlerdir/workdir/progress
What is detected for the logsDir
? It should tell you if it is loaded properly from your config in the first place (unless they error you are getting occurs before this gets printed).
I doubt this could be it, but are you using a variable file? If so, is it one you are explicitly passing as an argument to the collector startup shell script? If so, do you always pass it?
I think your problem is with file permission. Can you give my config file a try and see if it works for you? I tested them on linux lubuntu 14
file minimum-config.xml https://gist.github.com/martinfou/a08682d98efeeecce9ad
file minimum-config.variables https://gist.github.com/martinfou/b2eb5155e60e6b9f4003
I checked with the config - it runs correctly and delivers following warning message on any launch with any config, including default:
log4j:WARN No appenders could be found for logger (org.apache.velocity.runtime.log.Log4JLogChute).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Why it tries to create directory /home/{username}/Norconex/jef/workdir/latest/logs
when workdir in the config is set to <workDir>${workdir}</workDir>
(workdir = /home/username/projects/norconex/test/)? The directory is always empty.
On my opinion collector-http should create service directories at ${workdir}
- it is possible someone wants /home will restricted to the collector, or run it under another user's name.
I should note in addition, that ...Norconex/jef...
directory is created on a second crawler launch - when ${workdir} already exists.
How do you start the Collector? Are you using one of the launch script that's packaged in the zip you have downloaded? Because out-of-the-box it should be pointing to the log4j.properties file found in the /classes/
folder where you unpacked the collector zip file. The warnings you get suggests it is not being picked up.
Yes, I use collector-http.sh from the zip package downloaded. I launch it using as absolute, as relative paths to the .sh script and to the config, as from collector's directory, as from a random dir.
I'll check access rights to /classes/ and comment here.
The user may read ./classes, as any other collector-http
subdirectory. The workdir, described in crawler's config is writeable.
You may try to reproduce the error with following flow:
/home/{username}/Norconex/jef/workdir/latest/logs
and log file at the workdir with changed timestamp.I believe there is enough writable workdir for a crawler to run correctly.
If you wish I may send you my configs to email, as I want to keep them private.
Your config would definitely help, since the behavior you describe should not happen with a config where workdir
and logsDir
are defined properly. You can check my github profile for my email.
I think I finally was able to reproduce. It seems like it attempts to create the logs directory under the default location (under user home directory) no matter what you specify in the config.
In my case it does not write any files under that user home location, and the files are instead written where they should as per configuration, but that extra (empty) log directory is created and should not.
Marking this as a bug.
I just made a snapshot release that should no longer create directories in the default location if you have specified otherwise. Let me know if that resolves the issue for you.
Thanks a lot! I'll test the new snapshot and provide feedback in a day.
I confirm the bug is fixed. Everything works perfectly. Thank you again for your help!
Thanks for confirming!
Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.
Cron mails me this warning every night:
log4j:WARN No appenders could be found for logger (org.apache.pdfbox.pdmodel.font.PDCIDFontType2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I am running the collector from a /etc/cron.d/norconex
like this:
15 23 * * Mon-Fri crawler cd /opt/norconex/collector-http/ && ./collector-http.sh -c ../intranet/daily/conf/crawler.xml -a resume > /dev/null
The logs are produced under /opt/norconex/intranet/daily/work/logs/
just fine, though. The file /opt/norconex/collector-http/log4j.properties
also exists. Both this file and the shell script are untouched from the zip file, no modifications. I also can't see any file permission problems.
I am running HTTP Collector version 2.7.1.
Strange. What if you change this log4j line:
log4j.rootLogger=INFO, CONSOLE
to
log4j.rootLogger=INFO, FILE_ONLY
Does it make a difference?
If not, do you also get the error when you run it directly from the command-line?
Please note that issue was already closed. Please create another one or your questions may get lost.
I have a configuration with line
When I launch the crawler first time it works perfectly (means no workdir created yet), but on the second launch (-a start) it delivers following error:
Why it try create
Norconex/jef
directory?