Closed jetnet closed 3 years ago
Do you have a way to reproduce? At first glance, I suspect it may be the field extraction from both the STDOUT and STDERR happening at the same time when running an external process. While it is being investigated, a possible workaround is to redirect your external process STDERR to STDOUT.
I don't know how to reproduce it, unfortunately: there are plenty crawlers running, don't even know, which instance was throwing that exception.
Thanks for the hint: I'll take a look at the ExternalTransformer
shell scripts.
Were you able to reproduce after redirecting STDERR? I had a deeper look and nothing jumps at me. I cannot see what other thread may be conflicting. Have you extended the crawler and spawning new threads yourself? Can you share your config in case something jumps at me?
I added a redirect for STDERR
to /dev/null
like that:
tty -s || exec 2> /dev/null
to the top of the shell scripts. Will see, if that helps...
P.S. the scripts are run like following:
<!-- generate "perceptive" hashes for images -->
<transformer class="$ExternalTransformer">
<restrictTo caseSensitive="false" field="document.contentType">image/.*</restrictTo>
<command>
${ffhome}/bin/ffImagePhash.sh -m ${INPUT_META} -o ${OUTPUT} -d ${OUTPUT_META} -t "middle:10_3 high:10_10"
</command>
<metadata inputFormat="properties" outputFormat="properties">
<pattern field="dummyField">to-make-XML-validator-happy</pattern>
</metadata>
<tempDir>/tmp</tempDir>
</transformer>
<!-- generate thumbnails for images -->
<transformer class="$ExternalTransformer">
<restrictTo caseSensitive="false" field="document.contentType">image/.*</restrictTo>
<command>
${ffhome}/bin/ffThumbnail.sh -w 212 -q 60 -m ${INPUT_META} -o ${OUTPUT} -p ${thumbnaildir}/${domain_xn} -u ${thumbnailurlpath}/${domain_xn}
</command>
<metadata inputFormat="properties">
<pattern field="thumbnailImage" caseSensitive="false">^/9j/.+</pattern>
<pattern field="thumbnailImagePath" caseSensitive="false">^${thumbnailurlpath}.+</pattern>
</metadata>
<tempDir>/tmp</tempDir>
</transformer>
It didn't help, the same errors are still being logged from time to time.
I still could not reproduce, but after digging further, I think I have a solution. The code is now making sure to use a dedicated instance of the non-thread-safe map for each thread involved in the ExternalTransformer. That should eliminate any form of concurrency possible on the faulty object.
A new HTTP Collector snapshot was made with an updated Importer lib, which is where the fix was made.
Please try and confirm.
installed and running, will let you know in a few days. Thanks a lot!
looks good - no more ConcurrentModificationException
exceptions!
Thank you very much again! Great support!
Hello Pascal,
there are some errors from time to time while the http crawler is running:
Version:
norconex-commons-lang-1.15.2-SNAPSHOT.jar
could you please take a look? Thanks!