Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Crawler not pulling page content #739

Closed jacksonp2008 closed 3 years ago

jacksonp2008 commented 3 years ago

The Crawler doesn't seem to be pulling content for one of my sites. I can see every other field in the data but not the content. The only material difference between this config and my other working sites is that I am using the sitemap:

It's a public site, so I don't mind posting the config.

Open to any ideas, many thanks.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

  <!-- This is the transition crawler to the new index -->

<httpcollector id="FS-doc-Collector">
  <logsDir>./forescout/update/docs-output/logs</logsDir>

  <crawlers>
    <!-- you can have multiple crawlers -->
    <crawler id="FS-doc-Crawler">
      <userAgent>"FS HTTP Client"</userAgent>
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://docs.forescout.com</url>
      </startURLs>

    <robotsTxt ignore="true"/>
    <workDir>./forescout/update/docs-output</workDir>

    <!-- Put a maximum depth to avoid infinite crawling  -->
    <maxDepth>24</maxDepth>

    <sitemapResolverFactory ignore="false" />

    <!-- Be as nice as you can to sites you crawl.Default=5000-->
    <delay default="500" />

    <!-- Document Filtering -->
    <documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">
      jpg,jpeg,gif,png
    </filter>
    </documentFilters>

    <!-- Document importing -->
    <importer>
      <preParseHandlers>
      <!-- Pre parsing taggers can go here -->

    <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />

      </preParseHandlers>

      <postParseHandlers>
      <!-- post parsing taggers can go here -->
      <!-- Rename fields with a prefix for the search engine, the document can be renamed in the committer -->
      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <restrictTo caseSensitive="false"
                  field="title">
          </restrictTo>
          <rename fromField="title" toField="fs_title" overwrite="true" />
      </tagger>

      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <restrictTo caseSensitive="false"
                  field="document.reference">
          </restrictTo>
          <rename fromField="document.reference" toField="fs_reference" overwrite="true" />
      </tagger>

      <tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
        field="@timestamp" format="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" />

      <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
        <constant name="search_title">Docs Portal</constant>
      </tagger>

      </postParseHandlers>

    </importer>

    <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
      <!-- elastic dev site -->
      <nodes>https://search-seryryryryryryryryryryryryryryrya.us-east-1.es.amazonaws.com:443</nodes>
      <indexName>docs</indexName>
      <typeName>docs</typeName>
      <targetContentField>fs_content</targetContentField>
      <fixBadIds>true</fixBadIds>
      <queueSize>500</queueSize>
    </committer>

    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 3 years ago

I had a look at a few pages defined in your sitemap and they are JavaScript-generated. Your options are:

jacksonp2008 commented 3 years ago

Thank-you Pascal ,trying 3.0.0