Crawler not pulling page content

jacksonp2008 commented 3 years ago

The Crawler doesn't seem to be pulling content for one of my sites. I can see every other field in the data but not the content. The only material difference between this config and my other working sites is that I am using the sitemap:

It's a public site, so I don't mind posting the config.

Open to any ideas, many thanks.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

  <!-- This is the transition crawler to the new index -->

<httpcollector id="FS-doc-Collector">
  <logsDir>./forescout/update/docs-output/logs</logsDir>

  <crawlers>
    <!-- you can have multiple crawlers -->
    <crawler id="FS-doc-Crawler">
      <userAgent>"FS HTTP Client"</userAgent>
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://docs.forescout.com</url>
      </startURLs>

    <robotsTxt ignore="true"/>
    <workDir>./forescout/update/docs-output</workDir>

    <!-- Put a maximum depth to avoid infinite crawling  -->
    <maxDepth>24</maxDepth>

    <sitemapResolverFactory ignore="false" />

    <!-- Be as nice as you can to sites you crawl.Default=5000-->
    <delay default="500" />

    <!-- Document Filtering -->
    <documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">
      jpg,jpeg,gif,png
    </filter>
    </documentFilters>

    <!-- Document importing -->
    <importer>
      <preParseHandlers>
      <!-- Pre parsing taggers can go here -->

    <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />

      </preParseHandlers>

      <postParseHandlers>
      <!-- post parsing taggers can go here -->
      <!-- Rename fields with a prefix for the search engine, the document can be renamed in the committer -->
      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <restrictTo caseSensitive="false"
                  field="title">
          </restrictTo>
          <rename fromField="title" toField="fs_title" overwrite="true" />
      </tagger>

      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <restrictTo caseSensitive="false"
                  field="document.reference">
          </restrictTo>
          <rename fromField="document.reference" toField="fs_reference" overwrite="true" />
      </tagger>

      <tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
        field="@timestamp" format="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" />

      <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
        <constant name="search_title">Docs Portal</constant>
      </tagger>

      </postParseHandlers>

    </importer>

    <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
      <!-- elastic dev site -->
      <nodes>https://search-seryryryryryryryryryryryryryryrya.us-east-1.es.amazonaws.com:443</nodes>
      <indexName>docs</indexName>
      <typeName>docs</typeName>
      <targetContentField>fs_content</targetContentField>
      <fixBadIds>true</fixBadIds>
      <queueSize>500</queueSize>
    </committer>

    </crawler>
  </crawlers>
</httpcollector>

essiembre commented 3 years ago

I had a look at a few pages defined in your sitemap and they are JavaScript-generated. Your options are:

Install PhantomJS (no longer supported by its author) and use the PhantomJSDocumentFetcher in your configuration.
Use the milestone release of version 3.0.0, which has native-browser support (chrome, firefox, etc.) for crawling such sites, via WebDriverHttpFetcher.
Create your own IHttpDocumentFetcher implementation.

jacksonp2008 commented 3 years ago

Thank-you Pascal ,trying 3.0.0

Norconex / crawlers

Crawler not pulling page content #739