Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Last batch is not committed to Solr #251

Closed V3RITAS closed 8 years ago

V3RITAS commented 8 years ago

We are using the Norconex HTTP Collector to crawl about 4.600 files and send the extracted information to a Solr server.

The crawling process itself works fine and during this process all batches are correctly sent to the Solr server (queue size 500, batch size 100). Unfortunately when the crawling process comes to 100% the last batch of documents is not sent to the Solr server and the process seems to stuck at this point:

Export: 2016-06-02 11:56:05 INFO - Export: 100% completed (4568 processed/4568 total)

If I check the tasks/processes on the OS the corresponding Java process is still running and is not terminted correctly, even after several hours. During this time the process is still accessing/modifying the .job file in the progress/latest/status folder.

I've tested this on my local PC with Windows and on a server with Unix installed. This happens when the crawler is started both manually and with Crontab.

Do you have any idea why this happens and how we can fix this? Unfortunately I can't find any log entries which point out the actual problem.

Here is our config just in case:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="HTTP Collector">

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlers>
    <crawler id="Export">

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        <url>...</url>  
      </startURLs>

      <workDir>${workdir}</workDir>
      <maxDepth>-1</maxDepth>
      <!--maxDocuments>100</maxDocuments-->
      <sitemap ignore="true" />
      <sitemapResolverFactory ignore="true" />
      <delay default="1500" />

      <httpClientFactory>

        <authMethod>basic</authMethod>
        <authUsername>...</authUsername>
        <authPassword>...</authPassword>

      </httpClientFactory>

      <referenceFilters>
          <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude" caseSensitive="false" >
              png,jpg,jpeg,gif,ico,css,js
          </filter>        
      </referenceFilters>

      <importer>

        <parseErrorsSaveDir>${workdir}/importer/parseErrors</parseErrorsSaveDir>            

        <preParseHandlers>

          <!-- Extract last update information from body -->
          <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="false" caseSensitive="false" >
              <textBetween name="lastupdate">
                  <start><![CDATA[<!--lastupdate:]]></start>
                  <end><![CDATA[-->]]></end>
              </textBetween>              
            <!-- only for html files -->
            <restrictTo field="document.contentType">text/html</restrictTo>
          </tagger>           

          <!-- Only consider content between startindex and stopindex comments -->
          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">          
            <stripBetween><start><![CDATA[<body>]]></start><end><![CDATA[<!--\s*?startindex\s*?-->]]></end></stripBetween>
            <stripBetween><start><![CDATA[<!--\s*?stopindex\s*?-->]]></start><end><![CDATA[<!--\s*?startindex\s*?-->]]></end></stripBetween>
            <stripBetween><start><![CDATA[<!--\s*?stopindex\s*?-->]]></start><end><![CDATA[</body>]]></end></stripBetween>
            <!-- only for html files -->
            <restrictTo field="document.contentType">text/html</restrictTo>
          </transformer> 

        </preParseHandlers>      

        <postParseHandlers>

          <!-- Extracts file name and extension from URL and puts them in defined fields; also sets file type according to defined mapping -->
          <tagger class="com.divae.informer.crawler.cli.tagger.FilenameTagger" filenameKey="file_name" extensionKey="file_extension">
            <extensionMappings mapKey="file_type"> 
              <extensionMapping to="word">
                <extension>doc</extension>
                <extension>dot</extension>
                <extension>rtf</extension>
                <extension>docx</extension>
                <extension>docm</extension>
                <extension>dotx</extension>
                <extension>dotm</extension>
              </extensionMapping>
              <extensionMapping to="excel">
                <extension>xls</extension>
                <extension>xlt</extension>
                <extension>xlsx</extension>
                <extension>xlsm</extension>
                <extension>xltx</extension>
              </extensionMapping>
              <extensionMapping to="pdf">
                <extension>pdf</extension>
              </extensionMapping>
              <extensionMapping to="powerpoint">
                <extension>ppt</extension>
                <extension>pptx</extension>
                <extension>pptm</extension>
                <extension>pps</extension>
                <extension>ppsx</extension>
                <extension>ppz</extension>
                <extension>ppa</extension>
                <extension>ppax</extension>
                <extension>pot</extension>
                <extension>potx</extension>
                <extension>potm</extension>                
              </extensionMapping>               
            </extensionMappings>
            <!-- only for non-html files -->            
            <restrictTo caseSensitive="false" field="document.contentType"><![CDATA[^.+\/(?!html).+$]]></restrictTo>              
          </tagger>          

          <!-- Sets the site type according to certain meta information and file type -->
          <tagger class="com.divae.informer.crawler.cli.tagger.SiteTypeTagger" 
              siteTypeFieldName="docType" keyName="site_type" fileType="file" unknownType="unknown">
            <types>
              <type mapping="seite">informer</type>
              <type mapping="meldung">news</type>                
            </types>
          </tagger>          

          <!-- Keep single values for some fields -->
          <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
            <!-- Binary files can have more than one date -->
            <singleValue field="Last-Modified" action="keepFirst" />
            <!-- Only keep first last-update date as extraction with TextBetweenTagger creates multivalue field -->
            <singleValue field="lastupdate" action="keepFirst" />               
            <!-- Merge all titles if multiple exist-->
            <singleValue field="title" action="mergeWith:; " />
          </tagger>

          <!-- Format last-modified date so it matches Solr expected date format -->
          <tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger" 
              fromField="Last-Modified" fromFormat="EEE, dd MMM yyyy HH:mm:ss z" toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" 
              overwrite="true">
          </tagger>

          <!-- Change date format of lastupdate field (is set by CMS); then copy it to Last-Modified field -->
          <tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger" 
              fromField="lastupdate" fromFormat="yyyy-MM-dd" toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" overwrite="true">
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
            <copy fromField="lastupdate" toField="Last-Modified" overwrite="true" />
            <!-- only for html files -->
            <restrictTo field="document.contentType">text/html</restrictTo>
          </tagger>             

          <!-- Set keyword field name to lower case (can be first character uppercase in some binary files) -->    
          <tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
            <characterCase fieldName="keywords" type="lower" applyTo="field" />
          </tagger>        

          <!-- Split keywords field as it is multivalued -->
          <tagger class="com.norconex.importer.handler.tagger.impl.SplitTagger">
            <split fromField="keywords" regex="false"><separator> </separator></split>
          </tagger>

          <!-- Only keep necessary tags -->          
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>
              id,document.reference,title,description,keywords,area,subarea,Last-Modified,site_type,dbcategory,breadcrumb,file_extension,file_type,file_name
            </fields>
          </tagger>

          <!-- Rename certain fields -->
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="Last-Modified" toField="last_modified" overwrite="true" />      
          </tagger>        

        </postParseHandlers>

      </importer> 

      <!-- Send data to Solr; better set "commitDisabled" parameter to true -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>...</solrURL>
        <commitDisabled>true</commitDisabled>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
        <queueDir>${workdir}/committer-queue</queueDir>
        <queueSize>500</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 8 years ago

It is hard for me to reproduce given I do not have your start URL and you use custom classes (which I always like to see!). Does it happen even if you have a low number of documents to crawl? For instance, does it happen when you uncomment the maxDocuments tag (100 docs max)?

Which version are you using? I am wondering if this problem relates to a thread lock issue with MapDB identified in #234 (which caused the crawler to hang). Have you tried the 2.5.0 snapshot version which uses a different "crawl store" by default? If you do not want a snapshot release, you can try explicitly setting a different crawl store. Try the following (note: changing to a new crawl store will force a "fresh" crawl):

<crawlDataStoreFactory 
          class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" />

Let me know if using MVStore instead of MapDB resolves the issue for you.

V3RITAS commented 8 years ago

Hi Pascal,

I know it's hard to reproduce without the start URL, but unfortunately we are crawling an internal website which is not available on the Internet. :(

This problem doesn't occur with a low number of documents, I've tried it successfully with 100 max docs. Infact we used to crawl about 2.300 documents without problems and just recently changed this to 4.600 files.

The custom classes you mentioned are very simple. We just use some meta information of the files to set specific fields in the documents (e.g. we check the file extension and set a "file type" field accordingly). If you want I can provide those classes to you.

We are already using the 2.5.0-SNAPSHOT version, but just to be sure I've added the XML snippet you provided to our config, but with the same result. It still stucks with 100% and doesn't commit the rest of the documents to Solr.

I've noticed in the Windows task manager that the corresponding java process uses 2.700 MB of RAM and still keeps locking this after it has stuck. I don't know if this is normal behavior or might be related to this problem. If I scroll a little bit through the console output I also see this:

INFO  - AbstractCrawler            - Export: 98% completed (4462 processed/4523 total)
Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space
INFO  - DOCUMENT_FETCHED           -          DOCUMENT_FETCHED: https://our-internal-url.de/document.pdf
        at java.util.Arrays.copyOf(Unknown Source)INFO  - CREATED_ROBOTS_META        -       CREATED_ROBOTS_META: https://our-internal-url.de/document.pdf

INFO  - URLS_EXTRACTED             -            URLS_EXTRACTED: https://our-internal-url.de/document.pdf
        at java.util.ArrayList.grow(Unknown Source)
        at java.util.ArrayList.ensureExplicitCapacity(Unknown Source)
        at java.util.ArrayList.ensureCapacityInternal(Unknown Source)
        at java.util.ArrayList.addAll(Unknown Source)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1269)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:70)
        at com.norconex.commons.lang.map.ObservableMap.get(ObservableMap.java:94)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1269)
        at com.norconex.commons.lang.map.Properties.addString(Properties.java:591)
        at com.norconex.importer.parser.impl.AbstractTikaParser.addTikaMetadata(AbstractTikaParser.java:214)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:433)
        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)WARN  - PDTrueTypeFont             - Using fallback font 'ArialMT' for 'Frutiger45Light'

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
        at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
INFO  - DOCUMENT_IMPORTED          -         DOCUMENT_IMPORTED: https://our-internal-url.de/document.pdf
        at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:323)
        at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)INFO  - DOCUMENT_COMMITTED_ADD     -    DOCUMENT_COMMITTED_ADD: https://our-internal-url.de/document.pdf

        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
INFO  - AbstractCrawler            - Export: 98% completed (4463 processed/4523 total)
        at com.norconex.importer.Importer.parseDocument(Importer.java:422)
        at com.norconex.importer.Importer.importDocument(Importer.java:318)
INFO  - RetryExec                  - I/O exception (java.net.SocketException) caught when processing request to {s}->https://our-internal-url.de:443: Software caused connection abort: socket write error
        at com.norconex.importer.Importer.doImportDocument(Importer.java:271)INFO  - RetryExec                  - Retrying request to {s}->https://our-internal-url.de:443

        at com.norconex.importer.Importer.importDocument(Importer.java:195)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)

But this happens at about 98% and the crawler seems to continue after this exception. This exception is also not written to the log file.

Do you have any other ideas what could cause this problem? Maybe it has something to do with the PDF file which is mentioned in the error message. Maybe I will do some testing without this file or even without PDF files.

essiembre commented 8 years ago

The OutOfMemoryError you found is more than likely the cause. Stability goes away upon getting memory errors. With the default of using 2 threads, I do not consider taking 2GB of memory "normal". Either there is a memory leak somewhere that's triggered by something in your crawling or config, or it is a third party library that eats too much memory when parsing a specific file (happened before).

To verify if the problem is with a specific file, is it possible for you to extract the bunch of files that appear in the logs before the OutOfMemoryError and send them to me for testing? I suspect one of them is causing the memory error. Maybe one of them has many embedded objects which when parsed is just too much. You can also create a new crawler for testing that just points to those few files in your logs near the error (add them all as start URLs). Hopefully that will help you narrow it to the guilty file.

Additionally, you can just increase the java memory when starting the crawler to a larger amount, but it would be nice to pinpoint the exact cause (and provide a fix).

V3RITAS commented 8 years ago

P.S.

During another test I've found the same stack trace but in a more clearer format:

Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3181)
    at java.util.ArrayList.grow(ArrayList.java:261)
    at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
    at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
    at java.util.ArrayList.addAll(ArrayList.java:579)
    at com.norconex.commons.lang.map.Properties.get(Properties.java:1269)
    at com.norconex.commons.lang.map.Properties.get(Properties.java:70)
    at com.norconex.commons.lang.map.ObservableMap.get(ObservableMap.java:94)
    at com.norconex.commons.lang.map.Properties.get(Properties.java:1269)
    at com.norconex.commons.lang.map.Properties.addString(Properties.java:591)
    at com.norconex.importer.parser.impl.AbstractTikaParser.addTikaMetadata(AbstractTikaParser.java:214)
    at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:433)
    at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
    at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
    at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
    at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:323)
    at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
    at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
    at com.norconex.importer.Importer.parseDocument(Importer.java:422)
    at com.norconex.importer.Importer.importDocument(Importer.java:318)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
    at com.norconex.importer.Importer.importDocument(Importer.java:195)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)

Looks a little bit like there is a problem with parsing a binary file with Tika which leads to the OutOfMemory exception. Unfortunately I can't figure out right now which document(s) are causing trouble.

By the way I did another test and just increased the heap memory size to 6 GB (-Xmx6g), but this also resulted in an OutOfMemory exception. I saw in the task manager that the Java process was allocating more than 4 GB of memory.

V3RITAS commented 8 years ago

Hi Pascal,

I just posted my comment at the same time you posted your feedback. :) I will try to locate the file(s) causing trouble, but I don't know if I can send them to you as these might be confidential. But at least I will try to analyze them and give you some feedback!

essiembre commented 8 years ago

That's way too much memory. Another approach to try find the guilty file is to add this setting to your crawler config:

 <keepDownloads>true</keepDownloads>

This can fill up your disk space so you should be careful, but maybe you can check what are the most recent files created when you get the errors. The above flag will add a /downloads directory under your workdir.

Another tip could be to try running with a Java Profiler to see if the memory consumption is slowly leaked or if it all get eaten up in a sudden moment. This would be another indication if it is a leak vs a file-specific issue.

If you are permitted, you can send your confidential files to me directly via email instead of posting them here (email is in my github profile).

essiembre commented 8 years ago

I meant <keepDownloads>true</keepDownloads> in my previous post... I fixed the post.

V3RITAS commented 8 years ago

Hi Pascal,

Just a quick note before I'm off for the weekend: I've found the culprit, it is a Powerpoint file which causes the OutOfMemory exception. I will have a closer look at it next week and give you feedback.

Thanks for your help so far!

essiembre commented 8 years ago

Glad you found the issue. Someone else reported issues with PowerPoint before. If you do not mind coding a bit, the following proved a successful solution...

Create this new class:

import java.util.Map;

import org.apache.tika.extractor.DocumentSelector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.OfficeParser;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.AbstractTikaParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    private final DocumentSelector pptSelector = new DocumentSelector() {
        @Override
        public boolean select(Metadata metadata) {
            return false;
        }
    };
    private final IDocumentParser pptParser = 
            new AbstractTikaParser(new OfficeParser()) {
        protected void modifyParseContext(ParseContext parseContext) {
            parseContext.set(DocumentSelector.class, pptSelector);
        };
    };

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {
        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();
        parsers.put(ContentType.valueOf(
                "application/vnd.ms-powerpoint"), pptParser);
        return parsers;
    }
}

Then add the above class to your <importer> config:

<documentParserFactory class="CustomDocumentParserFactory" />

Let me know if that works for you. I will consider making this a permanent solution to the Importer module.

jetnet commented 8 years ago

hi guys, yes, it was me :) my issue was - a ppt file with a lot of embedded images. The importer tries to extract meta-data from all of them and seems to be running into OOM. Pascal could not reproduce the issue with the stand-alone importer and I still cannot find a couple of minutes for my own test :)

V3RITAS commented 8 years ago

Hey guys,

I finally had the time to do some more testing. These are my results so far:

@jetnet: It would be great if you could keep us informed about your own tests if you have the time. But I think this looks like a really good solution!

@essiembre: I think it would be a good idea to make this a permanent solution. Unfortunately I can't send you the Powerpoint file for further analysis as it contains confidential stuff. But maybe you can try to create a big Powerpoint file with lots of images in it which hopefully causes the same OOM exception?

Thanks for your help guys, really appreciate it!

essiembre commented 8 years ago

@V3RITAS: I was able to test on a separate file someone sent me and it seems related to certain image files sometimes having too much metadata (generated metadata derived form the image). That metadata may not be of much use since it not meant for people to read in most cases so I will see what can be done to isolate the solution to those type of images.

V3RITAS commented 8 years ago

Thanks Pascal!

As I said the current solution works fine for us right now, but if you find a way to isolate it to those specific type of images then let me know.

essiembre commented 8 years ago

The Importer module was updated and now offers more control over what embedded documents you want to have extracted or not. The latest HTTP collector snapshot has the updated Importer module.

Refer to the Importer ticket https://github.com/Norconex/importer/issues/19 for more details (and usage samples). I am closing this tickets since parsing of embedded docs and memory issues is now a duplicate and better addressed in the Importer project.