Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

preParse / postParse script filter weirdness #86

Closed Pittiplatsch closed 5 years ago

Pittiplatsch commented 5 years ago

Hello Pascal,

I'm struggling with a weird script filter issue.

Even the most trivial script filter just returning true results in a REJECTED_IMPORT when used as postParseHandler, where the very same filter works as intended when used as preParseHandler.

This is my complete <importer> config:

<importer>
  <preParseHandlers>
    <filter
      class="com.norconex.importer.handler.filter.impl.ScriptFilter"
      engineName="javascript"
      onMatch="include"
    >
    <script>
      <![CDATA[
        /*return*/ true;
      ]]></script>
    </filter>
  </preParseHandlers>

  <postParseHandlers>
    <filter
      class="com.norconex.importer.handler.filter.impl.ScriptFilter"
      engineName="javascript"
      onMatch="include"
    >
    <script>
      <![CDATA[
        /*return*/ true;
      ]]></script>
    </filter>
  </postParseHandlers>
</importer>

Seems that I just miss something obvious here...

Thanks a lot.

essiembre commented 5 years ago

The only way I can reproduce this is if the "content" is empty (which is always considered false/non-matching). If it happens for you in the "postParseHandlers" only, I suspect parsing or one of your other handlers wipes out the content before it reaches the ScriptFilter. You can use the DebugTagger to help you figure that out.

Pittiplatsch commented 5 years ago

I reduced my configuration to the very minimal, with a page of obviously non-empty content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Test Collector">
  #set($workdir = "./output_debug/")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlers>

    <crawler id="script-test">
      <userAgent>Test user agent</userAgent>
      <workDir>$workdir</workDir>
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>${workdir}/crawledFiles</directory>
      </committer>
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        <url>http://www.example.com/</url>
      </startURLs>

      <importer>
        <preParseHandlers>
          <filter
            class="com.norconex.importer.handler.filter.impl.ScriptFilter"
            engineName="javascript"
            onMatch="include"
          >
            <script>
              <![CDATA[
              /*return*/ true;
            ]]></script>
          </filter>
        </preParseHandlers>

        <postParseHandlers>
          <filter
            class="com.norconex.importer.handler.filter.impl.ScriptFilter"
            engineName="javascript"
            onMatch="include"
          >
            <script>
              <![CDATA[
              /*return*/ true;
            ]]></script>
          </filter>
        </postParseHandlers>
      </importer>

    </crawler>
  </crawlers>
</httpcollector>

This results in:

[...]
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] script-test: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.example.com/
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.example.com/
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.example.com/
INFO  [CrawlerEventManager]           REJECTED_IMPORT: http://www.example.com/ (ImporterResponse[reference=http://www.example.com/,status=ImporterStatus[status=REJECTED,filter=<null>,exception=<null>,description=None of the filters with onMatch being INCLUDE got matched.],doc=<null>,nestedResponses=[]])
INFO  [AbstractCrawler] script-test: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] script-test: Crawler finishing: committing documents.
INFO  [AbstractCrawler] script-test: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] script-test: Crawler completed.
INFO  [AbstractCrawler] script-test: Crawler executed in 2 seconds.
[...]

When commenting out the entire postParseHandler, the crawl succeeds:

[...]
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] script-test: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.example.com/
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.example.com/
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.example.com/
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://www.example.com/
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://www.example.com/
INFO  [AbstractCrawler] script-test: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] script-test: Crawler finishing: committing documents.
INFO  [AbstractCrawler] script-test: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] script-test: Crawler completed.
INFO  [AbstractCrawler] script-test: Crawler executed in 2 seconds.
[...]

and the following resulting meta file 1544559170759000000-add.meta:

#
#Tue Dec 11 20:12:50 UTC 2018
document.contentType=text/html
Date=Tue, 11 Dec 2018 20\:12\:49 GMT
X-Parsed-By=org.apache.tika.parser.DefaultParser^|~org.apache.tika.parser.html.HtmlParser
Content-Location=http\://www.example.com/
Cache-Control=max-age\=604800
Etag="1541025663+gzip"
Content-Encoding=UTF-8
collector.depth=0
collector.is-crawl-new=true
Expires=Tue, 18 Dec 2018 20\:12\:49 GMT
Content-Type=text/html; charset\=UTF-8
Server=ECS (dca/24E0)
Last-Modified=Fri, 09 Aug 2013 23\:54\:35 GMT
document.reference=http\://www.example.com/
collector.content-encoding=UTF-8
Vary=Accept-Encoding
collector.content-type=text/html
document.contentFamily=html
Content-Length=1270
document.contentEncoding=UTF-8
X-Cache=HIT

I completely ran out of ideas...

essiembre commented 5 years ago

Can you share the document that is being rejected?

Also, what do you get if you put this just before yoru second ScriptFilter?

<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logContent="true" logLevel="WARN" >

Is any content printed?

Pittiplatsch commented 5 years ago

I put the debug twice: once right after the <preParseHandlers> tag, the other right after <postParseHandlers:


WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] X-Cache=HIT
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] Server=ECS (dca/532C)
WARN  [DebugTagger] collector.content-encoding=UTF-8
WARN  [DebugTagger] document.contentEncoding=UTF-8
WARN  [DebugTagger] Last-Modified=Fri, 09 Aug 2013 23:54:35 GMT
WARN  [DebugTagger] Date=Wed, 12 Dec 2018 05:30:11 GMT
WARN  [DebugTagger] document.reference=http://www.example.com/
WARN  [DebugTagger] Accept-Ranges=bytes
WARN  [DebugTagger] Cache-Control=max-age=604800
WARN  [DebugTagger] Etag="1541025663+ident"
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] Vary=Accept-Encoding
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Expires=Wed, 19 Dec 2018 05:30:11 GMT
WARN  [DebugTagger] Content-Type=text/html; charset=UTF-8
WARN  [DebugTagger] CONTENT=<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] X-Cache=HIT
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] X-Parsed-By=org.apache.tika.parser.DefaultParser, org.apache.tika.parser.html.HtmlParser
WARN  [DebugTagger] Server=ECS (dca/532C)
WARN  [DebugTagger] collector.content-encoding=UTF-8
WARN  [DebugTagger] Content-Location=http://www.example.com/
WARN  [DebugTagger] document.contentEncoding=UTF-8
WARN  [DebugTagger] Last-Modified=Fri, 09 Aug 2013 23:54:35 GMT
WARN  [DebugTagger] Date=Wed, 12 Dec 2018 05:30:11 GMT
WARN  [DebugTagger] document.reference=http://www.example.com/
WARN  [DebugTagger] Accept-Ranges=bytes
WARN  [DebugTagger] Cache-Control=max-age=604800
WARN  [DebugTagger] Etag="1541025663+ident"
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] Content-Encoding=UTF-8
WARN  [DebugTagger] Vary=Accept-Encoding
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Expires=Wed, 19 Dec 2018 05:30:11 GMT
WARN  [DebugTagger] Content-Length=1270
WARN  [DebugTagger] Content-Type=text/html; charset=UTF-8
WARN  [DebugTagger] CONTENT=

The crawled page is actually http://www.example.com. I also tried some other pages, obviously starting with a various production sites, with the same result - that's why I reduced the config to the minimum above.

Your assumption about the content being empty seems right, however I have no idea where the clearance could take place. Additionally, the Content-Length=1270 proves the "non-emptyness". I'm not aware of any additional config or automatism... The config file stands alone in its own directory without any *.variables or similar. Is my minimal config should be ok?

wgetting the same page within the very same console succeeds; as seen in the log above; additionally the content is available within <preParseHandlers>.

Pittiplatsch commented 5 years ago

Addition: I now used the minimum config example, and just added the script filters - to the same result.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./output_debug/progress</progressDir>
  <logsDir>./output_debug/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.norconex.com/product/collector-http-test/minimum.php</url>
      </startURLs>

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./output_debug</workDir>

      <!-- Trust insecure certificates -->
      <!-- §§§ global-worth setting? §§§ -->
      <httpClientFactory>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
      </httpClientFactory>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->
      <importer>
        <preParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logContent="true" logLevel="WARN" />

          <filter
            class="com.norconex.importer.handler.filter.impl.ScriptFilter"
            engineName="javascript"
            onMatch="include"
          >
            <script>
              <![CDATA[
              /*return*/ true;
            ]]></script>
          </filter>

        </preParseHandlers>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logContent="true" logLevel="WARN" />

          <filter
            class="com.norconex.importer.handler.filter.impl.ScriptFilter"
            engineName="javascript"
            onMatch="include"
          >
            <script>
              <![CDATA[
              /*return*/ true;
            ]]></script>
          </filter>

        </postParseHandlers>
      </importer> 

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./output_debug/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
user@host:/opt/norconex/current$ rm -r output_debug & clear & ./collector-http.sh -a start -k -c config_debug/minimum-config-simple.xml
[1] 15267
[2] 15268
log4j:WARN No appenders could be found for logger (org.apache.velocity.runtime.log.Log4JLogChute).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Dec 12, 2018 6:47:10 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./output_debug/logs; progressDir=./output_debug/progress
INFO  [AbstractCollectorLauncher] No XML configuration errors.
INFO  [JobSuite] JEF work directory is: ./output_debug/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.8.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.0 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Wed Dec 12 06:47:11 UTC 2018)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [GenericHttpClientFactory] SSL: Trusting all certificates.
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] Server=nginx
WARN  [DebugTagger] Connection=keep-alive
WARN  [DebugTagger] MS-Author-Via=DAV
WARN  [DebugTagger] Date=Wed, 12 Dec 2018 06:47:13 GMT
WARN  [DebugTagger] document.reference=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Content-Length=3592
WARN  [DebugTagger] Content-Type=text/html
WARN  [DebugTagger] X-Powered-By=PleskLin
WARN  [DebugTagger] CONTENT=
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="author" content="Norconex Inc." />
    <title>Norconex HTTP Collector Test Page</title>
  </head>
  <body style="background-color: white; font-family:Arial,Verdana">
    <div style="max-width:970px; padding:20px; margin:auto;">

      <p style="color:#666666; font-size: smaller;">
        <b>Congratulations!</b> If you read this text from your target repository (e.g. file system, search engine, ...)
        it means that you successfully ran the Norconex HTTP Collector <b style="color:red;">minimum</b> example.
      </p>

            <div style="text-align: center;">
        <img src="http://www.norconex.com/collectors/img/collector-http.png" alt="" />
        <h1>Norconex HTTP Collector Test Page</h1>
      </div>

      <div style="border: 1px solid #dddddd; background-color: #f6f6f6; padding:20px;">
            <h3>We are excited that you are trying the Norconex HTTP Collector.</h3>
            <p>This standalone web page was created to help you test your installation is running properly.</p>
            <p>Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you
               on the Norconex HTTP Collector web site:</p>
            <p style="font-size:1.2em; font-family: monospace;">
              <script>document.write('<a h' + 'ref="http' + '://www.norconex' + '.com/collectors/collector-http/configuration' + '">');</script>
              http://www.norconex.com/collectors/collector-http/configuration
              <script>document.write('</a>');</script>
            </p>
      </div>

      <div>
        <h2>The Next Steps</h2>
        <p>The next logical step is probably to put in a different URL to crawl in the <code>startURLs</code> section of your configuration.
           The process of changing the start URL is an easy 2 steps process.</p>
        <p>First step: modify the URL between the following tags<p>
<pre style="color: #6666aa;">
  &lt;startURLs&gt;
    &lt;url&gt;<b>http://www.YourOwnUrl.com/</b>&lt;/url&gt;
  &lt;/startURLs&gt;
</pre>

        <p>Second step: Add or update regular expression to let the crawler know which URL patterns you are now accepting.</p>
<pre style="color: #6666aa;">
  &lt;referenceFilters&gt;
    &lt;filter class=&quot;com.norconex.collector.core.filter.impl.RegexReferenceFilter&quot; onMatch=&quot;include&quot;&gt;
      <b>http://www.YourOwnUrl.com/onlyThisSubset/.*</b>
    &lt;/filter&gt;
  &lt;/referenceFilters&gt;
 </pre>
      </div>

      <div style="margin-bottom: 50px;">
        <h2>Now What?</h2>
        <p>There obviously are tons of options available to you now.  You probably want to crawl more than one page,
        filter out some files such as CSS or Javascript, and much more.  You also want to install a "Committer" for your
        search engine (or other target repository). Learn how to do all this and more magic using the Norconex HTTP Collector
        site documentation (above URL).
        </p>
      </div>
      <div style="font-size: 28px; text-align: center; margin-bottom: 50px;">
        Thank you for using Norconex HTTP Collector!
      </div>
      <hr size="1">
      <div style="margin-bottom: 200px;">
        <p style="float:left; color:#999999;">Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.</p>
        <p style="float:right;"><img src="http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png"></p>
      </div>
    </div>

    </div>

  </body>
</html>

WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] X-Parsed-By=org.apache.tika.parser.DefaultParser, org.apache.tika.parser.html.HtmlParser
WARN  [DebugTagger] Server=nginx
WARN  [DebugTagger] Content-Location=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] Connection=keep-alive
WARN  [DebugTagger] MS-Author-Via=DAV
WARN  [DebugTagger] Date=Wed, 12 Dec 2018 06:47:13 GMT
WARN  [DebugTagger] document.reference=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] Content-Encoding=UTF-8
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Content-Length=3592
WARN  [DebugTagger] Content-Type=text/html, text/html; charset=UTF-8
WARN  [DebugTagger] X-Powered-By=PleskLin
WARN  [DebugTagger] CONTENT=
INFO  [CrawlerEventManager]           REJECTED_IMPORT: https://www.norconex.com/product/collector-http-test/minimum.php (ImporterResponse[reference=https://www.norconex.com/product/collector-http-test/minimum.php,status=ImporterStatus[status=REJECTED,filter=<null>,exception=<null>,description=None of the filters with onMatch being INCLUDE got matched.],doc=<null>,nestedResponses=[]])
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 3 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Wed Dec 12 06:47:11 UTC 2018)
[1]-  Done                    rm -r output_debug
[2]+  Done                    clear
essiembre commented 5 years ago

I have to admit I am scratching my head about that one. I used the exact same config as you (only the paths are different). It works just fine:

INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./workdir/importer-issue86/logs; progressDir=./workdir/importer-issue86/progress
INFO  [JobSuite] JEF work directory is: .\workdir\importer-issue86\progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: "Collector" version is undefined.
INFO  [AbstractCollector] Version: "Collector Core" version is undefined.
INFO  [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Thu Dec 13 23:55:00 EST 2018)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [GenericHttpClientFactory] SSL: Trusting all certificates.
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] Server=nginx
WARN  [DebugTagger] Connection=keep-alive
WARN  [DebugTagger] MS-Author-Via=DAV
WARN  [DebugTagger] Date=Fri, 14 Dec 2018 04:55:02 GMT
WARN  [DebugTagger] document.reference=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Content-Length=3592
WARN  [DebugTagger] Content-Type=text/html
WARN  [DebugTagger] X-Powered-By=PleskLin
WARN  [DebugTagger] CONTENT=
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="author" content="Norconex Inc." />
    <title>Norconex HTTP Collector Test Page</title>
  </head>
  <body style="background-color: white; font-family:Arial,Verdana">
    <div style="max-width:970px; padding:20px; margin:auto;">

      <p style="color:#666666; font-size: smaller;">
        <b>Congratulations!</b> If you read this text from your target repository (e.g. file system, search engine, ...)
        it means that you successfully ran the Norconex HTTP Collector <b style="color:red;">minimum</b> example.
      </p>

            <div style="text-align: center;">
        <img src="http://www.norconex.com/collectors/img/collector-http.png" alt="" /> 
        <h1>Norconex HTTP Collector Test Page</h1>
      </div>

      <div style="border: 1px solid #dddddd; background-color: #f6f6f6; padding:20px;">
            <h3>We are excited that you are trying the Norconex HTTP Collector.</h3>
            <p>This standalone web page was created to help you test your installation is running properly.</p>
            <p>Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you
               on the Norconex HTTP Collector web site:</p>
            <p style="font-size:1.2em; font-family: monospace;">
              <script>document.write('<a h' + 'ref="http' + '://www.norconex' + '.com/collectors/collector-http/configuration' + '">');</script>
              http://www.norconex.com/collectors/collector-http/configuration
              <script>document.write('</a>');</script>
            </p>
      </div>

      <div>
        <h2>The Next Steps</h2>
        <p>The next logical step is probably to put in a different URL to crawl in the <code>startURLs</code> section of your configuration. 
           The process of changing the start URL is an easy 2 steps process.</p>
        <p>First step: modify the URL between the following tags<p>
<pre style="color: #6666aa;">
  &lt;startURLs&gt;
    &lt;url&gt;<b>http://www.YourOwnUrl.com/</b>&lt;/url&gt;
  &lt;/startURLs&gt;
</pre>

        <p>Second step: Add or update regular expression to let the crawler know which URL patterns you are now accepting.</p>
<pre style="color: #6666aa;">
  &lt;referenceFilters&gt;
    &lt;filter class=&quot;com.norconex.collector.core.filter.impl.RegexReferenceFilter&quot; onMatch=&quot;include&quot;&gt;
      <b>http://www.YourOwnUrl.com/onlyThisSubset/.*</b>
    &lt;/filter&gt;
  &lt;/referenceFilters&gt;
 </pre>
      </div>

      <div style="margin-bottom: 50px;">
        <h2>Now What?</h2>
        <p>There obviously are tons of options available to you now.  You probably want to crawl more than one page, 
        filter out some files such as CSS or Javascript, and much more.  You also want to install a "Committer" for your
        search engine (or other target repository). Learn how to do all this and more magic using the Norconex HTTP Collector
        site documentation (above URL).
        </p>
      </div>
      <div style="font-size: 28px; text-align: center; margin-bottom: 50px;">
        Thank you for using Norconex HTTP Collector!
      </div>
      <hr size="1">
      <div style="margin-bottom: 200px;">
        <p style="float:left; color:#999999;">Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.</p>
        <p style="float:right;"><img src="http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png"></p>
      </div>
    </div>

    </div>

  </body>
</html>

WARN  [DebugTagger] collector.content-type=text/html
WARN  [DebugTagger] document.contentFamily=html
WARN  [DebugTagger] X-Parsed-By=org.apache.tika.parser.DefaultParser, org.apache.tika.parser.html.HtmlParser
WARN  [DebugTagger] Server=nginx
WARN  [DebugTagger] Content-Location=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] author=Norconex Inc.
WARN  [DebugTagger] Connection=keep-alive
WARN  [DebugTagger] MS-Author-Via=DAV
WARN  [DebugTagger] title=Norconex HTTP Collector Test Page
WARN  [DebugTagger] Date=Fri, 14 Dec 2018 04:55:02 GMT
WARN  [DebugTagger] document.reference=https://www.norconex.com/product/collector-http-test/minimum.php
WARN  [DebugTagger] dc:title=Norconex HTTP Collector Test Page
WARN  [DebugTagger] collector.is-crawl-new=true
WARN  [DebugTagger] document.contentType=text/html
WARN  [DebugTagger] Content-Encoding=UTF-8
WARN  [DebugTagger] collector.depth=0
WARN  [DebugTagger] Content-Length=3592
WARN  [DebugTagger] Content-Type=text/html, text/html; charset=UTF-8
WARN  [DebugTagger] X-Powered-By=PleskLin
WARN  [DebugTagger] CONTENT=

        Congratulations! If you read this text from your target repository (e.g. file system, search engine, ...)
        it means that you successfully ran the Norconex HTTP Collector minimum example.

        Norconex HTTP Collector Test Page

            We are excited that you are trying the Norconex HTTP Collector.

            This standalone web page was created to help you test your installation is running properly.

            Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you
               on the Norconex HTTP Collector web site:

              http://www.norconex.com/collectors/collector-http/configuration

        The Next Steps

        The next logical step is probably to put in a different URL to crawl in the startURLs section of your configuration. 
           The process of changing the start URL is an easy 2 steps process.

        First step: modify the URL between the following tags

  <startURLs>
    <url>http://www.YourOwnUrl.com/</url>
  </startURLs>

        Second step: Add or update regular expression to let the crawler know which URL patterns you are now accepting.

  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      http://www.YourOwnUrl.com/onlyThisSubset/.*
    </filter>
  </referenceFilters>

        Now What?

        There obviously are tons of options available to you now.  You probably want to crawl more than one page, 
        filter out some files such as CSS or Javascript, and much more.  You also want to install a "Committer" for your
        search engine (or other target repository). Learn how to do all this and more magic using the Norconex HTTP Collector
        site documentation (above URL).

        Thank you for using Norconex HTTP Collector!

        Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.

INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 4 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Thu Dec 13 23:55:00 EST 2018)

The only thing I can think of is the parser wipes out the file content in your case, but not sure why since we are using the same parser. To test that theory, can you try skip document parsing and see if it still fails? Some this like this in your <importer> section:

  <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <ignoredContentTypes>.*</ignoredContentTypes>
  </documentParserFactory>

Could it be a permission issue? If the above confirms the theory, I recommend you set your loglevel in the log4j.properties to TRACE for everything. You will get tons of output, but see if anything suspicious comes up.

Pittiplatsch commented 5 years ago

Phew, that was a brain twister. Skipping the parser actually preserves the content. From that result, the idea arised about an inconsistent installation. So, I "reinstalled" the Norconex stack into another, independent directory, and copied the overly trivial config example. Know what? It works 👍

Further investigation revealed that the lib directories differed a little bit (lib -> failing version, lib.orig -> working version):

Only in lib: commons-lang3-3.5.jar
Only in lib.orig/: commons-lang3-3.6.jar
Only in lib: httpclient-4.5.2.jar
Only in lib.orig/: httpclient-4.5.3.jar
Only in lib: httpcore-4.4.5.jar
Only in lib.orig/: httpcore-4.4.6.jar
Only in lib: httpmime-4.5.2.jar
Only in lib.orig/: httpmime-4.5.3.jar
Only in lib.orig/: jcl-over-slf4j-1.7.24.jar
Only in lib: jcl-over-slf4j-1.7.7.jar
Only in lib.orig/: json-1.8.jar.bak-20181214080137
Only in lib: noggit-0.6.jar.bak-20181007195814
Only in lib: norconex-committer-core-2.1.0.jar
Only in lib.orig/: norconex-committer-core-2.1.2.jar
Only in lib: norconex-commons-lang-1.13.0.jar
Only in lib.orig/: norconex-commons-lang-1.15.0.jar
Only in lib: slf4j-api-1.7.12.jar
Only in lib.orig/: slf4j-api-1.7.24.jar

Right from my start with Norconex, I additionally installed the Solr committer v2.3.0.

Long story short: Solr commiter 2.3.0 contains older versions of some files than the plain HTTP collector 2.8.1, and obviously (perhaps as a beginner's mistake) I didn't follow the recommended installation option of going with the highest version.

Sorry for the hassle :-( and thank you for your efforts.