Closed mitchelljj closed 8 years ago
You might want to try to use this on the <preParseHandlers>
stage. Once the content is parsed it is missing these tags.
what does "on the stage" mean?
Text dropped, so edited post.
You need to move the config from postParseHandlers to preParseHandlers stage.
I just switched to preParseHandlers and reran and it still does not strip out.
Which version of the HTTP Collector are you using? Please attach your full config to reproduce.
HTTP Collector 2.2.1 and I attached the full config file via an email to pascal.essiembre@norconex.com
OK I just tried with the latest HTTP Collector snapshot version and it works fine. Text from your header/footer/sidenav could not be found in the extracted text.
If you do not want a snapshot release, there is a fix to the StripBetweenTagger that was made to the Importer module 2.5.1 which you can download here.
Replace the norconex-importer-x.x.x.jar you have under your installation "lib" directory with the newer jar of the same name you'll get form the above download.
Is this working now with the latest version of the Importer?
Using the
<crawler id="Norconex OCR Search">
<importer>
<preParseHandlers>
<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer"
inclusive="true">
<stripBetween>
<start><![CDATA[<div id="header">]]></start>
<end><![CDATA[<div id="columns" class="row">]]></end>
</stripBetween>
<stripBetween>
<start><![CDATA[<div id="rightnav" class="col-sm-3">]]></start>
<end><![CDATA[<!-- end right hand nav area -->]]></end>
</stripBetween>
<stripBetween>
<start><![CDATA[<div id="footer-columns" class="row">]]></start>
<end><![CDATA[<!--END Content Zone "L" -->]]></end>
</stripBetween>
</transformer>
</preParseHandlers>
</importer>
<!-- === Minimum required: =========================================== -->
<!-- Requires at least one start URL. -->
<startURLs>
<url>http://www2.ed.gov/about/offices/list/ocr/docs/investigations/</url>
</startURLs>
<!-- At a minimum make sure you stay on your domain. -->
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">http://www2\.ed\.gov/about/offices/list/ocr/docs/investigations/.*</filter>
</referenceFilters>
</crawler>
I am trying to use Norconex HTTP Collector to configure its importer module to strip what's between headers, rightnavs and footers but is does not seem to be stripping what is between these known tags.
Below are the divs within the html documents
<div id="header">........</div>
<div id="rightnav" class="col-sm-3">........</div>
<div id="footer">........</div>
Below is part of the XML file