Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

ElasticSearchCommitter said it committed documents but the index is empy #556

Closed abolotnov closed 5 years ago

abolotnov commented 5 years ago

I am learning to configure and run the crawler and use ElasticSearchCommitter. I see the following in the logs:

Crawler_1: 2019-02-05 19:53:36 INFO - Sending 10 commit operations to Elasticsearch.
Crawler_1: 2019-02-05 19:53:36 INFO - Done sending commit operations to Elasticsearch.

Elastic logs don't have much:

[2019-02-05T19:48:46,711][INFO ][o.e.c.m.MetaDataCreateIndexService] [BiURPDs] [norconex] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2019-02-05T19:48:47,136][INFO ][o.e.c.m.MetaDataMappingService] [BiURPDs] [norconex/ojQVO0-gQQOvThpd7FPC-Q] create_mapping [web]
[2019-02-05T19:48:47,189][INFO ][o.e.c.m.MetaDataMappingService] [BiURPDs] [norconex/ojQVO0-gQQOvThpd7FPC-Q] update_mapping [web]

But the norconex index in elasticsearch is empty. I can see the index and its fields but it has no context.

Is there anything I am missing in my config or anything I need to configure on elastic end? It's pretty vanila installation, didn't change much in it.

<httpcollector id="Norconex Complex Collector">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($committerClass = "com.norconex.committer.elasticsearch.ElasticsearchCommitter")
  #set($searchUrl = "http://127.0.0.1:9200")

  <progressDir>/srv/genderfair/progs/norconex/collector/progress</progressDir>
  <logsDir>/srv/genderfair/progs/norconex/collector/logs</logsDir>

  <crawlerDefaults>
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,txt,xml,json,jpeg,tiff,doc,docx,ppt,png</filter>
    </referenceFilters>
    <urlNormalizer class="$urlNormalizer">
      <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
        removeDotSegments
      </normalizations>
    </urlNormalizer>
    <maxDepth>4</maxDepth>
    <workDir>/srv/genderfair/progs/norconex/collector</workDir>
    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="false" />
  </crawlerDefaults>
  <crawlers>

    <crawler id="Crawler_1">
      <httpClientFactory>
      <trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
    <maxDocuments>500</maxDocuments>
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <urlsFile>/srv/genderfair/data/norconex/urls_short.txt</urlsFile>
      </startURLs>
      <committer class="$committerClass">
                <nodes>$searchUrl</nodes>
                <indexName>norconex</indexName>
            <typeName>web</typeName>
                <queueDir>committer-queue</queueDir>
                <targetContentField>body</targetContentField>
                <queueSize>10</queueSize>
            <commitBatchSize>50</commitBatchSize>
        </committer>
    </crawler>

  </crawlers>
</httpcollector>
abolotnov commented 5 years ago

My bad, I confused Creation-Date with Date field and everything got filtered out in kibana.