Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

HTTP Collector 3.0.0 - ReplaceTransformer has no effect on content #773

Closed adesso-thomas-lippitsch closed 2 years ago

adesso-thomas-lippitsch commented 2 years ago

Hi,

I don't get the ReplaceTransformer to work in the Norconex 3.0.0-SNAPSHOT (2021-12-20). Either I am missing something in the configuration or it just does not have any effect on the content field. It doesn't matter if I use 'basic' or 'regex' methods.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2020 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<httpcollector id="Minimum Config HTTP Collector">

  <workDir>./workdir/example</workDir>
  <maxConcurrentCrawlers>1</maxConcurrentCrawlers>

  <crawlerDefaults>

    <robotsTxt ignore="true" />
    <robotsMeta ignore="true" />
    <maxDepth>1</maxDepth>
    <sitemapResolver ignore="true" />
    <delay default="5 seconds" />

    <committers>
      <committer class="XMLFileCommitter">
        <indent>4</indent>
      </committer>
    </committers>

    <importer>
      <preParseHandlers>
        <handler class="ReplaceTransformer">
          <replace>
            <valueMatcher method="basic" replaceAll="true" ignoreCase="true">domain</valueMatcher>
            <toValue>XXX</toValue>
          </replace>
          <replace>
            <valueMatcher method="regex" replaceAll="true">\s+</valueMatcher>
            <toValue> </toValue>
          </replace>
        </handler>
      </preParseHandlers>
    </importer>

  </crawlerDefaults>

  <crawlers>
    <crawler id="Example 1">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.example.com</url>
      </startURLs>
    </crawler>
  </crawlers>

</httpcollector>

The content field just remains untouched:

<content>
    Example Domain

        This domain is for use in illustrative examples in documents. You may use this
        domain in literature without prior coordination or asking for permission.

        More information...
</content>

Best regards, Tom

essiembre commented 2 years ago

By default, such text matching tries to match the entire string. You may try adding partial="true" to your <valueMatcher ... >.

Also, if you want special multi-line handling, you can use regex operators as the first thing in your regex. For instance, for "single-line" mode: (?s)myregex. Other examples of such "flags" are described in the Java Pattern documentation.

adesso-thomas-lippitsch commented 2 years ago

Thanks for your quick reply!

My goal is to clean the content string below from all \s, \n, \r and \t characters:

"content": "Example Domain\n\n    This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\n\n    More information..."

But I'd like to have a string like that:

"content": "Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information..."

In Python re.sub('[\s\n]+', ' ', s) yields exactly the result shown above, but I cannot reproduce it in the ReplaceTransformer.

With partial="true" and \s+ I achieve this result:

"content": "Example Domain\n This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n More information..."

But no matter what I try I can't get rid of the \n characters. Even though \s+ seems replace one \n of the \n\n pairs.

essiembre commented 2 years ago

Can you attach the document associated with your test? Is it an HTML page?

The issue could be related to your ReplaceTransformer being defined under preParseHandlers. This means the replaced is applied before text is extracted so if, for instance, your content is mixed with HTML markups, it will be applied on the whole thing and not just the extracted content as you see it under "content".

Try moving it under postParseHandlers to see if it makes any difference.

adesso-thomas-lippitsch commented 2 years ago

Thanks a lot! The difference between pre and postParse was the missing link! Now I finally understood the difference between pre and postParse. :)

When using postParse and partial="true" it works as desired:

<importer>
  <postParseHandlers>
    <handler class="ReplaceTransformer">
      <replace>
        <valueMatcher method="regex" replaceAll="true" partial="true">[\n\s\r\t]+</valueMatcher>
        <toValue xml:space="preserve"> </toValue>
      </replace>
    </handler>
  </postParseHandlers>
</importer>