Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

I am trying to configure its importer module to strip what's between headers, rightnavs and footers but is does not seem to be stripping what is between these known tags. #250

Closed mitchelljj closed 8 years ago

mitchelljj commented 8 years ago

I am trying to use Norconex HTTP Collector to configure its importer module to strip what's between headers, rightnavs and footers but is does not seem to be stripping what is between these known tags.

Below are the divs within the html documents

<div id="header">........</div>

<div id="rightnav" class="col-sm-3">........</div>

<div id="footer">........</div>

Below is part of the XML file

``` ]]> ]]> ]]> ]]> ]]> ]]>
OkkeKlein commented 8 years ago

You might want to try to use this on the <preParseHandlers> stage. Once the content is parsed it is missing these tags.

mitchelljj commented 8 years ago

what does "on the stage" mean?

OkkeKlein commented 8 years ago

Text dropped, so edited post.

You need to move the config from postParseHandlers to preParseHandlers stage.

mitchelljj commented 8 years ago

I just switched to preParseHandlers and reran and it still does not strip out.

essiembre commented 8 years ago

Which version of the HTTP Collector are you using? Please attach your full config to reproduce.

mitchelljj commented 8 years ago

HTTP Collector 2.2.1 and I attached the full config file via an email to pascal.essiembre@norconex.com

essiembre commented 8 years ago

OK I just tried with the latest HTTP Collector snapshot version and it works fine. Text from your header/footer/sidenav could not be found in the extracted text.

If you do not want a snapshot release, there is a fix to the StripBetweenTagger that was made to the Importer module 2.5.1 which you can download here.

Replace the norconex-importer-x.x.x.jar you have under your installation "lib" directory with the newer jar of the same name you'll get form the above download.

essiembre commented 8 years ago

Is this working now with the latest version of the Importer?

mitchelljj commented 8 years ago

Using the tag (see below config file section) I am able to successfully ignore sections within crawled web pages but the problem is that for other content that is also crawled like PDF's that don't contain these sections within the PDF these PDF's are not even committed but if I take out the tags from the config file then these PDF's are committed but as expected the other crawled web pages no longer ignore these sections.

<crawler id="Norconex OCR Search">
  <importer>
    <preParseHandlers>
      <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer"
        inclusive="true">
        <stripBetween>
          <start><![CDATA[<div id="header">]]></start>
          <end><![CDATA[<div id="columns" class="row">]]></end>
        </stripBetween>
        <stripBetween>
          <start><![CDATA[<div id="rightnav" class="col-sm-3">]]></start>
          <end><![CDATA[<!-- end right hand nav area -->]]></end>
        </stripBetween>
        <stripBetween>
          <start><![CDATA[<div id="footer-columns" class="row">]]></start>
          <end><![CDATA[<!--END Content Zone "L"  -->]]></end>
        </stripBetween>
      </transformer>
    </preParseHandlers>
  </importer> 

  <!-- === Minimum required: =========================================== -->

  <!-- Requires at least one start URL. -->
  <startURLs>
    <url>http://www2.ed.gov/about/offices/list/ocr/docs/investigations/</url>
  </startURLs>

  <!-- At a minimum make sure you stay on your domain. -->
  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">http://www2\.ed\.gov/about/offices/list/ocr/docs/investigations/.*</filter>
  </referenceFilters>
</crawler>