Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Committing Only Different File Type #377

Closed zgjonbalaj closed 6 years ago

zgjonbalaj commented 7 years ago

I am having issues isolating different crawlers to different types of documents so i can commit to elasticsearch. I want to be able to utilize the different for pdf, xml, html, images etc. What i would like to do is crawl all HTML pages and commit those under one index, crawl all PDF documents and store under another etc. For some reason the following code is not working, can anyone help?:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="Wallach Configuration">

  <!-- Decide where to store generated files. -->
  <progressDir>./crawled-sites/example.com/progress</progressDir>
  <logsDir>./crawled-sites/example.com/logs</logsDir>

  <crawlerDefaults>

    <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
      <url>http://www.example.com</url>
    </startURLs>

    <userAgent>ElasticSearch Crawler</userAgent>
    <numThreads>4</numThreads>
    <maxDepth>10</maxDepth>
    <sitemapResolverFactory ignore="false" />
    <delay default="250" />
    <workDir>./crawled-sites/example.com</workDir>

  </crawlerDefaults>

  <crawlers>
    <crawler id="Eample Website HTML Crawler">

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,gif,png,ico,css,js,pdf,xml,doc,docx,txt</filter>
      </referenceFilters>

      <workDir>./crawled-sites/example.com_html</workDir>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>*:9200</nodes>
        <indexName>wallach-html</indexName>
        <typeName>HTML</typeName>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <discoverNodes>false</discoverNodes>
        <dotReplacement></dotReplacement>
        <username>elastic</username>
        <password>changeme</password>
        <!-- <sourceReferenceField keep="true">sourceRef</sourceReferenceField> -->
        <!-- <sourceContentField keep="true">sourceContent</sourceContentField> -->
        <targetContentField>content</targetContentField>
        <queueDir>./committer-queue</queueDir>
        <queueSize>1000</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>0</maxRetries>
        <maxRetryWait>0</maxRetryWait>
      </committer>

    </crawler>

    <crawler id="Eample Website PDF Crawler">

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">pdf</filter>
      </referenceFilters>

      <workDir>./crawled-sites/example.com_pdf</workDir>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>*:9200</nodes>
        <indexName>wallach-pdf</indexName>
        <typeName>PDF</typeName>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <discoverNodes>false</discoverNodes>
        <dotReplacement></dotReplacement>
        <username>elastic</username>
        <password>changeme</password>
        <!-- <sourceReferenceField keep="true">sourceRef</sourceReferenceField> -->
        <!-- <sourceContentField keep="true">sourceContent</sourceContentField> -->
        <targetContentField>content</targetContentField>
        <queueDir>./committer-queue</queueDir>
        <queueSize>1000</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>0</maxRetries>
        <maxRetryWait>0</maxRetryWait>
      </committer>
    </crawler>

    <crawler id="Eample Website Word Document Crawler">

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">doc,docx</filter>
      </referenceFilters>

      <workDir>./crawled-sites/example.com_documents</workDir>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>*:9200</nodes>
        <indexName>wallach-word-documents</indexName>
        <typeName>Word Document</typeName>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <discoverNodes>false</discoverNodes>
        <dotReplacement></dotReplacement>
        <username>elastic</username>
        <password>changeme</password>
        <!-- <sourceReferenceField keep="true">sourceRef</sourceReferenceField> -->
        <!-- <sourceContentField keep="true">sourceContent</sourceContentField> -->
        <targetContentField>content</targetContentField>
        <queueDir>./committer-queue</queueDir>
        <queueSize>1000</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>0</maxRetries>
        <maxRetryWait>0</maxRetryWait>
      </committer>
    </crawler>

    <crawler id="Eample Website XML Crawler">

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">xml</filter>
      </referenceFilters>

      <workDir>./crawled-sites/example.com_xml</workDir>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>*:9200</nodes>
        <indexName>wallach-xml</indexName>
        <typeName>XML</typeName>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <discoverNodes>false</discoverNodes>
        <dotReplacement></dotReplacement>
        <username>elastic</username>
        <password>changeme</password>
        <!-- <sourceReferenceField keep="true">sourceRef</sourceReferenceField> -->
        <!-- <sourceContentField keep="true">sourceContent</sourceContentField> -->
        <targetContentField>content</targetContentField>
        <queueDir>./committer-queue</queueDir>
        <queueSize>1000</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>0</maxRetries>
        <maxRetryWait>0</maxRetryWait>
      </committer>
    </crawler>

    <crawler id="Eample Website Image Crawler">

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">jpg,gif,png,ico</filter>
      </referenceFilters>

      <workDir>./crawled-sites/example.com_images</workDir>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>*:9200</nodes>
        <indexName>wallach-images</indexName>
        <typeName>Image</typeName>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <discoverNodes>false</discoverNodes>
        <dotReplacement></dotReplacement>
        <username>elastic</username>
        <password>changeme</password>
        <!-- <sourceReferenceField keep="true">sourceRef</sourceReferenceField> -->
        <!-- <sourceContentField keep="true">sourceContent</sourceContentField> -->
        <targetContentField>content</targetContentField>
        <queueDir>./committer-queue</queueDir>
        <queueSize>1000</queueSize>
        <commitBatchSize>100</commitBatchSize>
        <maxRetries>0</maxRetries>
        <maxRetryWait>0</maxRetryWait>
      </committer>
    </crawler>

  </crawlers>
</httpcollector>
zgjonbalaj commented 7 years ago

I've managed to get the right files crawled but still getting this error for all PDFs when committing to a new index:

{ "_index": "website-pdf", "_type": "HTML", "_id": "http://www.***.com/site/assets/files/1653/***.pdf", "error": { "reason": "mapper [Date] of different type, current_type [date], merged_type [text]", "type": "illegal_argument_exception" }, "status": 400 },

essiembre commented 7 years ago

This probably occurs because you have a field in Elasticsearch defined as a date and you have a document passing a string for that field that is not recognized as a date. Either create a new index with a date field being string, or format your date fields to match a date pattern expected by Elasticsearch. You can have a look at DateFormatTagger from the Importer module as an option.

essiembre commented 7 years ago

@zgjonbalaj, did you resolve your date issue? Can we close?

lemmikens commented 6 years ago

Just to further elaborate on this in case anyone runs into this issue... Just create your index with specific mappings before committing to Elasticsearch it like so:

curl -XPUT 'localhost:9200/index-name?pretty' -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "my_type": {
      "properties": {
        "Date": {
          "type": "text"
        },
        "Last-Modified": {
          "type": "text"
        }
      }
    }
  }
}
'

Where your type is the "typeName" and index-name is the "indexName" part of your xml config.

wolverline commented 6 years ago

I've faced the same issue in committing PDF files. I think the issue is coming from the following date differences. Crawler collects only one date field from HTML pages. But there's additional date field from PDFs. Because of this, Elasticsearch treats the date as an array? But the strange part is some PDFs go through committer without any issues. Is there anyway I can kill one of the records (leave only one record)??

Last-Modified = Tue, 23 Jan 2018 01:23:07 GMT
Last-Modified = 2017-12-12T17:50:08Z
...
Date = Tue, 23 Jan 2018 17:49:35 GMT
Date = 2017-12-12T17:50:08Z
essiembre commented 6 years ago

@wolverline: Definitely, have a look at KeepOnlyTagger. To be used late in your post-parse handler.

wolverline commented 6 years ago

@essiembre: My question was how to remove one of the dates; not one of the tags. In Elasticsearch, the date field looks like the following because of the differences. If from PDFs:

"Date": [
  "Thu, 01 Feb 2018 22:40:27 GMT",
  "2017-12-12T17:50:08Z"
],

If from HTMLs:

"Date": "Thu, 01 Feb 2018 22:29:13 GMT",

I replaced the checksummer with combined fields. But not sure which one can bring more accurate results.

<documentChecksummer class="$MD5Checksummer">
  <sourceFields>content, title, description</sourceFields>
  <!-- Or, this field?
  <sourceFields>Last-Modified</sourceFields>
  -->
</documentChecksummer>
zgjonbalaj commented 6 years ago

@essiembre I apologize for the delay, yes the issue was resolved.

essiembre commented 6 years ago

@wolverline, it looks like I gave you the wrong one then. Please try with: ForceSingleValueTagger.

You have two dates probably because there is one in your HTML and one in the HTTP Response headers. As an alternative, you can also configure the metadataFetcherin your collector config to add a headers.prefix. This will differentiate the fields from the HTTP headers from other ones discovered.

Please confirm.

wolverline commented 6 years ago

@essiembre I was going to add my comment here about ForceSingleValueTagger before you answered. Yes, it resolved the issue. Thanks.

<importer>
  <postParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
       <singleValue field="Date" action="keepFirst" />
       <singleValue field="Last-Modified" action="keepFirst" />
    </tagger>
  </postParseHandlers>
</importer>
essiembre commented 6 years ago

Thanks for confirming.