Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

StripBetweenTransformer parsing too literally? #78

Closed kengher closed 6 years ago

kengher commented 6 years ago

Hello! In reference to #370, I am trying to eliminate the MENU section of my HTML code, however, I am experiencing issues using the example provided in the documentation:

The following will strip all text between (and including) these two HTML comments: <!-- SIDENAV_START --> and <!-- SIDENAV_END -->.

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer"
          inclusive="true" >
      <stripBetween>
          <start><![CDATA[<!-- SIDENAV_START -->]]></start>
          <end><![CDATA[<!-- SIDENAV_END -->]]></end>
      </stripBetween>
  </transformer>

My HTML:

<!-- SIDENAV_START -->
      <div class='menu'>
           <a>Home</a>
           <a>About</>
           ...
      </div>
<!-- SIDENAV_END -->

The data inbetween <!-- SIDENAV_START --> and <!-- SIDENAV_END --> is still passing through. Now, when I type in this literal text into my HTML...

<![CDATA[<!-- SIDENAV_START -->]]>
      <div class='menu'>
           <a>Home</a>
           <a>About</>
           ...
      </div>
<![CDATA[<!-- SIDENAV_END -->]]>

...It works. "Home" and "About" has now been ignored, except now in my HTML <![CDATA[ is rendered as text in the browser before and after the menu.

It is as if all HTML tags have been omitted before the Importer gets the chance to regex the data.

Does the Importer/Transformer have be set in a strict order before something else happens? Does <![CDATA[<!-- SIDENAV_START -->]]> not work in the latest version?

I am using HTTP Collector v2.8.0 + Elasticsearch Comitter v4.1.0. Here is my config:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="Del Rey Config HTTP Collector">
  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/ulti/progress</progressDir>
  <logsDir>./examples-output/ulti/logs</logsDir>

  <crawlers>
    <crawler id="Ulti">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="true">
        <url>http://localhost/</url>
      </startURLs>

      <userAgent>Norconex</userAgent>

      <referenceFilters>
        <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>

      </referenceFilters>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/ulti</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>3</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,body,document.reference</fields>
          </tagger>

          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <stripBetween>
              <start><![CDATA[<!-- SIDENAV_START -->]]></start>
              <end><![CDATA[<!-- SIDENAV_END -->]]></end>
            </stripBetween>
          </transformer>

        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
    <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
      <nodes>http://localhost:9200</nodes>
      <indexName>ulti</indexName>
      <typeName>page</typeName>
      <ignoreResponseErrors>false</ignoreResponseErrors>
      <discoverNodes>false</discoverNodes>
      <connectionTimeout>1000</connectionTimeout>
      <socketTimeout>30000</socketTimeout>
      <maxRetryTimeout>30000</maxRetryTimeout>
      <fixBadIds>false</fixBadIds>
      <queueDir>./commiter-queue</queueDir>
      <queueSize>1000</queueSize>
      <commitBatchSize>100</commitBatchSize>
      <maxRetries>0</maxRetries>
      <maxRetryWait>0</maxRetryWait>
    </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 6 years ago

I can see from your config you have defined your StripBetweenTransformer as a postParseHandlers. That is likely your problem. Once your HTML gets parsed by the importer, you will only have plain text (markup is gone). Move it under preParseHandlers instead. Just be careful not to run this transformer on other files (especially binaries, like PDFs). Consider using the restrictTo within the transformer tag to only apply it to HTML.

kengher commented 6 years ago

Thanks for the quick reply! Yes, I had completely missed the preParse handler. It is working as intended now after relocating the transformer, and thanks for the restrictTo tip. This issue can be closed.