Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

type name in ElasticSearch sometimes is not corresponding as specified in a crawler #5

Closed tungdx closed 8 years ago

tungdx commented 8 years ago

Hi, I encountered a problem as following: I had 2 crawlers: 1) vnexpress crawler will commit data into vnexpress type in ElasticSearch. 2) dantri crawler will commit data into dantri type in ElasticSearch.

But, sometimes a document from vnexpress crawler is committed to dantri type and vice versa.

I used:

This is my config file:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Crawler">
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>
    <crawlerDefaults>
        <maxDepth>5</maxDepth>
        <sitemapResolverFactory ignore="true" />
        <delay default="1000" />
        <importer>
            <preParseHandlers>
                <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.contentType">
                (text/html|text/htm)
                </filter>
            </preParseHandlers>
            <postParseHandlers>
              <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                <fields>title,document.reference</fields>
              </tagger>
            </postParseHandlers>
      </importer> 
    </crawlerDefaults>  
  <crawlers>    
    <crawler id="vnexpress">
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>http://vnexpress.net/</url>
        </startURLs>        
        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>news</indexName>
            <typeName>vnexpress</typeName>  
            <queueSize>1</queueSize>
        </committer>
    </crawler>
    <crawler id="dantri">
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>http://dantri.com.vn/</url>
        </startURLs>        
        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>news</indexName>
            <typeName>dantri</typeName> 
            <queueSize>1</queueSize>
        </committer>
    </crawler>
  </crawlers>

</httpcollector>

This is data in ElasticSearch
![norconex](https://cloud.githubusercontent.com/assets/1876051/18226348/fc89a9e4-7231-11e6-81f3-26643a04eb56.PNG)
essiembre commented 8 years ago

Add a different <queueDir>...</queueDir> tag to each of your Elasticsearch Committer configs and it should fix your issue.

tungdx commented 8 years ago

It worked exactly as my expectation. Thank you very much!