Norconex / commons-lang

Generic library shared between several projects.
Apache License 2.0
12 stars 7 forks source link

ConcurrentModificationException #13

Closed jetnet closed 3 years ago

jetnet commented 3 years ago

Hello Pascal,

there are some errors from time to time while the http crawler is running:

Exception in thread "StreamConsumer-STDOUT" java.util.ConcurrentModificationException
        at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1493)
        at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1516)
        at com.norconex.commons.lang.map.Properties.caseResolvedKey(Properties.java:1569)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1508)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:82)
        at com.norconex.commons.lang.map.ObservableMap.get(ObservableMap.java:94)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1508)
        at com.norconex.commons.lang.map.Properties.addString(Properties.java:835)
        at com.norconex.importer.util.regex.RegexFieldExtractor.extractFields(RegexFieldExtractor.java:117)
        at com.norconex.importer.util.regex.RegexUtil.extractFields(RegexUtil.java:69)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer.extractMetaFromLine(ExternalTransformer.java:725)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer.access$100(ExternalTransformer.java:299)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer$1.lineStreamed(ExternalTransformer.java:674)
        at com.norconex.commons.lang.io.InputStreamLineListener.flushBuffer(InputStreamLineListener.java:117)
        at com.norconex.commons.lang.io.InputStreamLineListener.streamed(InputStreamLineListener.java:82)
        at com.norconex.commons.lang.io.InputStreamConsumer.fireStreamed(InputStreamConsumer.java:150)
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:100)

Version: norconex-commons-lang-1.15.2-SNAPSHOT.jar could you please take a look? Thanks!

essiembre commented 3 years ago

Do you have a way to reproduce? At first glance, I suspect it may be the field extraction from both the STDOUT and STDERR happening at the same time when running an external process. While it is being investigated, a possible workaround is to redirect your external process STDERR to STDOUT.

jetnet commented 3 years ago

I don't know how to reproduce it, unfortunately: there are plenty crawlers running, don't even know, which instance was throwing that exception. Thanks for the hint: I'll take a look at the ExternalTransformer shell scripts.

essiembre commented 3 years ago

Were you able to reproduce after redirecting STDERR? I had a deeper look and nothing jumps at me. I cannot see what other thread may be conflicting. Have you extended the crawler and spawning new threads yourself? Can you share your config in case something jumps at me?

jetnet commented 3 years ago

I added a redirect for STDERR to /dev/null like that: tty -s || exec 2> /dev/null to the top of the shell scripts. Will see, if that helps...

P.S. the scripts are run like following:

<!-- generate "perceptive" hashes for images -->
<transformer class="$ExternalTransformer">
      <restrictTo caseSensitive="false" field="document.contentType">image/.*</restrictTo>
      <command>
          ${ffhome}/bin/ffImagePhash.sh -m ${INPUT_META} -o ${OUTPUT} -d ${OUTPUT_META} -t "middle:10_3 high:10_10"
      </command>
      <metadata inputFormat="properties" outputFormat="properties">
        <pattern field="dummyField">to-make-XML-validator-happy</pattern>
      </metadata>
      <tempDir>/tmp</tempDir>
</transformer>

<!-- generate thumbnails for images -->
<transformer class="$ExternalTransformer">
      <restrictTo caseSensitive="false" field="document.contentType">image/.*</restrictTo>
      <command>
          ${ffhome}/bin/ffThumbnail.sh -w 212 -q 60 -m ${INPUT_META} -o ${OUTPUT} -p ${thumbnaildir}/${domain_xn} -u ${thumbnailurlpath}/${domain_xn}
      </command>
      <metadata inputFormat="properties">
          <pattern field="thumbnailImage" caseSensitive="false">^/9j/.+</pattern>
          <pattern field="thumbnailImagePath" caseSensitive="false">^${thumbnailurlpath}.+</pattern>
      </metadata>
      <tempDir>/tmp</tempDir>
</transformer>
jetnet commented 3 years ago

It didn't help, the same errors are still being logged from time to time.

essiembre commented 3 years ago

I still could not reproduce, but after digging further, I think I have a solution. The code is now making sure to use a dedicated instance of the non-thread-safe map for each thread involved in the ExternalTransformer. That should eliminate any form of concurrency possible on the faulty object.

A new HTTP Collector snapshot was made with an updated Importer lib, which is where the fix was made.

Please try and confirm.

jetnet commented 3 years ago

installed and running, will let you know in a few days. Thanks a lot!

jetnet commented 3 years ago

looks good - no more ConcurrentModificationException exceptions! Thank you very much again! Great support!