Closed abolotnov closed 5 years ago
I am learning to configure and run the crawler and use ElasticSearchCommitter. I see the following in the logs:
Crawler_1: 2019-02-05 19:53:36 INFO - Sending 10 commit operations to Elasticsearch. Crawler_1: 2019-02-05 19:53:36 INFO - Done sending commit operations to Elasticsearch.
Elastic logs don't have much:
[2019-02-05T19:48:46,711][INFO ][o.e.c.m.MetaDataCreateIndexService] [BiURPDs] [norconex] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings [] [2019-02-05T19:48:47,136][INFO ][o.e.c.m.MetaDataMappingService] [BiURPDs] [norconex/ojQVO0-gQQOvThpd7FPC-Q] create_mapping [web] [2019-02-05T19:48:47,189][INFO ][o.e.c.m.MetaDataMappingService] [BiURPDs] [norconex/ojQVO0-gQQOvThpd7FPC-Q] update_mapping [web]
But the norconex index in elasticsearch is empty. I can see the index and its fields but it has no context.
norconex
Is there anything I am missing in my config or anything I need to configure on elastic end? It's pretty vanila installation, didn't change much in it.
<httpcollector id="Norconex Complex Collector"> #set($http = "com.norconex.collector.http") #set($core = "com.norconex.collector.core") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter") #set($committerClass = "com.norconex.committer.elasticsearch.ElasticsearchCommitter") #set($searchUrl = "http://127.0.0.1:9200") <progressDir>/srv/genderfair/progs/norconex/collector/progress</progressDir> <logsDir>/srv/genderfair/progs/norconex/collector/logs</logsDir> <crawlerDefaults> <referenceFilters> <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,txt,xml,json,jpeg,tiff,doc,docx,ppt,png</filter> </referenceFilters> <urlNormalizer class="$urlNormalizer"> <normalizations> removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, removeDotSegments </normalizations> </urlNormalizer> <maxDepth>4</maxDepth> <workDir>/srv/genderfair/progs/norconex/collector</workDir> <!-- We know we don't want to crawl the entire site, so ignore sitemap. --> <sitemapResolverFactory ignore="false" /> </crawlerDefaults> <crawlers> <crawler id="Crawler_1"> <httpClientFactory> <trustAllSSLCertificates>true</trustAllSSLCertificates> </httpClientFactory> <maxDocuments>500</maxDocuments> <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false"> <urlsFile>/srv/genderfair/data/norconex/urls_short.txt</urlsFile> </startURLs> <committer class="$committerClass"> <nodes>$searchUrl</nodes> <indexName>norconex</indexName> <typeName>web</typeName> <queueDir>committer-queue</queueDir> <targetContentField>body</targetContentField> <queueSize>10</queueSize> <commitBatchSize>50</commitBatchSize> </committer> </crawler> </crawlers> </httpcollector>
My bad, I confused Creation-Date with Date field and everything got filtered out in kibana.
Creation-Date
Date
I am learning to configure and run the crawler and use ElasticSearchCommitter. I see the following in the logs:
Elastic logs don't have much:
But the
norconex
index in elasticsearch is empty. I can see the index and its fields but it has no context.Is there anything I am missing in my config or anything I need to configure on elastic end? It's pretty vanila installation, didn't change much in it.