Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Strip text from HTML body - best practice #237

Closed V3RITAS closed 8 years ago

V3RITAS commented 8 years ago

I'm using the Norconex HTTP Collector to crawl HTML files and send certain meta-fields and the content text to a Solr server.

What I now want to do is to only send text from the content to the Solr server which is between certain HTML comments and ignore the rest. My HTML code looks like this:

<body>
  ...
  <!-- START -->
     Some text to be extracted
  <!-- END -->
   ...
     Ignore this text
   ...
  <!-- START -->
     Some other text to be extracted
  <!-- END -->
  ...
</body>

As you can see there can be multiple start-end chunks in one file to be extracted.

I already solved this by using multiple StripBetweenTransformer in the pre parse handler:

<preParseHandlers> 

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" sourceCharset="ISO-8859-1" >
    <stripBetween>
      <start><![CDATA[<body>]]></start>
      <end><![CDATA[<!-- START -->]]></end>
    </stripBetween>
  </transformer> 

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" sourceCharset="ISO-8859-1" >
    <stripBetween>
      <start><![CDATA[<!-- END -->]]></start>
      <end><![CDATA[<!-- START -->]]></end>
    </stripBetween>
  </transformer>           

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" sourceCharset="ISO-8859-1" >
    <stripBetween>
      <start><![CDATA[<!-- END -->]]></start>
      <end><![CDATA[</body>]]></end>
    </stripBetween>
  </transformer>

<preParseHandlers> 

But I think this is very unhandy, especially because I also have to define a sourceCharset. I have to do this as the HTML files also include links to other document types (like PDF). In case I don't define a sourceCharset I get a NullPointerException error while parsing these documents:

java.lang.NullPointerException: charsetName

Is there a more elegant way to achieve this?

I've tried to use the TextBetweenTagger in the post parse handler but I couldn't figure out how to get the text of the HTML body, modify it with this tagger and send it to the Solr server as content.

essiembre commented 8 years ago

The NPE is obviously wrong, so I am marking it as a bug. You should not have to specify the charset unless you know you have charset issues with your crawling.

All the "taggers" are to manipulate metadata only, without modifying the content. To manipulate the content, you have to rely on "transformers". The one that I think is more suited to achieve what you are after is ReplaceTransformer.

That being said, you can keep using TextBetweenTagger and store the results in a new field metadata field (the name attribute). Let's say you call that new field "myContentField", then when using the Solr Committer, you can overwrite the default by specifying which field to use for the source content instead of the actual document body.

<committer class="com.norconex.committer.solr.SolrCommitter">
     ... 
     <sourceContentField keep="false">myContentField</sourceContentField>
     ...
</committer>

Please report if that does not work for you.

V3RITAS commented 8 years ago

Hi Pascal,

Thank you very much for your fast response! :)

Please note that I also get the same NPE when parsing documents with the TextBetweenTagger without using the sourceCharset parameter. It seems that this is the same problem like in the StripBetweenTagger.

So what you suggest is to use the TextBetweenTagger as pre parse handler and store the parsed text into a new field (e.g. contentNew), correct? I've changed my config like this:

...
<preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="false" sourceCharset="ISO-8859-1" >
      <textBetween name="contentNew">
          <start><![CDATA[<!-- START -->]]></start>
          <end><![CDATA[<!-- END -->]]></end>
      </textBetween>
  </tagger>
</preParseHandlers>  
...
<committer class="com.norconex.committer.solr.SolrCommitter">
  ...
  <sourceContentField keep="false">contentNew</sourceContentField>
  <targetContentField>text</targetContentField>
  ...
</committer>

This works fine as it now commits only the content between my start-end tags to Solr. Unfortunately I now lose the content of the non-HTML-files (like PDFs) as they don't have these tags included of course.

Is there a way to only apply the TextBetweenTagger to HTML files and keep the content of all other files as it is?

Otherwise my inital solution might not be as unhandy as I thought, provided I don't have to state the encoding anymore in a future release. ;)

essiembre commented 8 years ago

Can you provide a copy of your config or the URL of the page causing the NullPointerException with StripBetweenTransformer? I cannot reproduce it. A full stacktrace could also help.

V3RITAS commented 8 years ago

I'm sorry but I can't give you the URL of the page as it is in our intranet and not available for visitors from outside.

But here is the content of my config xml file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://localhost/test/index.html</url>
      </startURLs>

      <numThreads>3</numThreads>      
      <orphansStrategy>DELETE</orphansStrategy>
      <workDir>${workdir}</workDir>
      <maxDepth>2</maxDepth>
      <sitemapResolverFactory ignore="true" />
      <delay default="5000" />

      <importer>

        <parseErrorsSaveDir>${workdir}/importer/parseErrors</parseErrorsSaveDir>      

        <preParseHandlers>          

          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <stripBetween>
              <start><![CDATA[<body>]]></start>
              <end><![CDATA[<!-- START -->]]></end>
            </stripBetween>
          </transformer> 

          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <stripBetween>
              <start><![CDATA[<!-- END -->]]></start>
              <end><![CDATA[<!-- START -->]]></end>
            </stripBetween>
          </transformer>           

          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <stripBetween>
              <start><![CDATA[<!-- END -->]]></start>
              <end><![CDATA[</body>]]></end>
            </stripBetween>
          </transformer>

        </preParseHandlers>           

        <postParseHandlers>

          <!-- Set "keywords" field name to lowercase -->
          <tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
            <characterCase fieldName="keywords" type="lower" applyTo="field" />
          </tagger>        

          <!-- Only keep necessary tags -->          
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>id,document.reference,document.contentType,title,keywords,area,subarea,Last-Modified,Last-Save-Date</fields>
          </tagger>

          <!-- Rename certain fields -->
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="Last-Modified" toField="last_modified" overwrite="true" />      
            <rename fromField="Last-Save-Date" toField="last_saved" overwrite="true" />      
            <rename fromField="document.contentType" toField="contentType" overwrite="true" />      
          </tagger>        

        </postParseHandlers>

      </importer> 

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>${workdir}/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

When parsing a simple PDF file it returns the following stack trace:

Norconex Minimum Test Page: 2016-03-21 15:50:18 ERROR - Norconex Minimum Test Page: Could not process document: http://localhost/test/docs/test.pdf (charsetName)
java.lang.NullPointerException: charsetName
    at java.io.InputStreamReader.<init>(Unknown Source)
    at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.java:106)
    at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:56)
    at com.norconex.importer.Importer.transformDocument(Importer.java:552)
    at com.norconex.importer.Importer.executeHandlers(Importer.java:352)
    at com.norconex.importer.Importer.importDocument(Importer.java:309)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
    at com.norconex.importer.Importer.importDocument(Importer.java:195)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:298)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

If I use sourceCharset="ISO-8859-1" everything works fine.

essiembre commented 8 years ago

Haaa... I see what causes this. When using "pre-parse handlers", you have to be careful to account for files that are not text-based. In your case, since you want to deal with HTML files only, I suggest you restrict to just those files using the restricTo tag. But I did found a real bug with it dropping <stripBetween> tags you have, sharing the same <start> tag as another one. This is fixed in an Importer snapshot release I just made.

All in all, this should work for you now (a bit more elegant, and only dealing with HTML):

    <preParseHandlers>
      <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
        <stripBetween>
          <start><![CDATA[<body>]]></start>
          <end><![CDATA[<!-- START -->]]></end>
        </stripBetween>
        <stripBetween>
          <start><![CDATA[<!-- END -->]]></start>
          <end><![CDATA[<!-- START -->]]></end>
        </stripBetween>
        <stripBetween>
          <start><![CDATA[<!-- END -->]]></start>
          <end><![CDATA[</body>]]></end>
        </stripBetween>
        <restrictTo field="document.contentType">text/html</restrictTo>
      </transformer> 

    </preParseHandlers>

Please give it a try by replacing the norconex-importerXXX.jar you have under lib with the snapshot found in the link above and confirm. If all works fine for you I will make it an actual Importer release with the fix.

essiembre commented 8 years ago

FYI, I just updated the importer snapshot release to now log a warning if the character encoding cannot be detected when using character-based transformers. With the importer module, if a charset cannot be detected, it more than likely means you are dealing with a binary file. The warning will highlight this, and suggest options, making it easier to troubleshoot.

V3RITAS commented 8 years ago

Hi Pascal,

I've changed my pre parse handler config according to your proposal and replaced the norconex-importerXXX.jar with the one from the snapshot. Works perfectly and exactly how I need it. :)

I also checked your warning message by removing the restrictTo tag and it also works fine.

I guess this ticket can be marked as closed now. Thank you very much for your great support!

essiembre commented 8 years ago

Thanks for confirming! Importer 2.5.1 has just been formally released.