Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Filter usage question #108

Closed essiembre closed 9 years ago

essiembre commented 9 years ago

From @madsbrydegaard, moved from https://github.com/Norconex/collector-http/issues/48#issuecomment-101662531:

I tried implementing the filter option:

<postParseHandlers>
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
onMatch="include"
caseSensitive="false">
.\bkeyword\b.
</filter>
</postParseHandlers>

However pages without the "keyword" still gets committed. What is the correct way to use this feature?

Thnx

essiembre commented 9 years ago

@madsbrydegaard, what you have seems just fine. What version are you using? Is the code snippet you provided within an importer tag? Can you paste a copy of your config?

madsbrydegaard commented 9 years ago

Version is latest - just downloaded today...

Here goes full config. Both seed sites gets committed...

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to 
     run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- === Minimum required: =========================================== -->

      <!-- Requires at least one start URL. -->
      <startURLs>
        <url>http://www.vidcore.com</url>
        <url>http://www.lagocph.dk</url>
      </startURLs>

      <!-- === Minimum recommended: ======================================== -->

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemap ignore="true" /> 

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="1000" />

      <!-- At a minimum make sure you stay on your domain. -->
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
            onMatch="exclude" >
          jpg,gif,png,ico,css,js
        </filter>
      <!--
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
          http://www\.norconex\.com/product/collector-http-test/.*
        </filter>
        -->
      </referenceFilters>

      <importer>

        <postParseHandlers>
            <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
              onMatch="include" 
              caseSensitive="false">
              .*\byoga\b.*
            </filter>
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                <fields>title,keywords,description,document.reference</fields>
            </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
madsbrydegaard commented 9 years ago

Pascal did you have a chance to look at my config?

Venlig hilsen / Kind regards Mads Brydegaard Tlf. (+45) 2868 1707

www.vidcore.com | Online video | Mobile video

2015-05-13 16:48 GMT+02:00 Pascal Essiembre notifications@github.com:

@madsbrydegaard https://github.com/madsbrydegaard, what you have seems just fine. What version are you using? Is the code snippet you provided within an importer tag? Can you paste a copy of your config?

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/108#issuecomment-101697101 .

martinfou commented 9 years ago

Hello Mads,

I'll give your config a try and get back to you.

martinfou commented 9 years ago

I was able to reproduce the bug. Let me check the source code.

madsbrydegaard commented 9 years ago

Any news on the subject?

Venlig hilsen / Kind regards Mads Brydegaard Tlf. (+45) 2868 1707

2015-05-20 15:27 GMT+02:00 Martin Fournier notifications@github.com:

I was able to reproduce the bug. Let me check the source code.

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/108#issuecomment-103888508 .

martinfou commented 9 years ago

I looked at the source code and I cannot find the bug for the life of me. I will need Pascal's help on this one.

essiembre commented 9 years ago

I just had a chance to try your config and found out what is going on. For the onMatch operation to take effect, there as to be a match in the first place. A document which does not trigger a match won't have the onMatch operation applied to it, and the filter has no effect on it (so it goes through normally).

So to achieve what you want to accomplish, you have to match the documents you do not want, and specify onMatch="exclude". This has been tested to work with the latest snapshot version:

<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
  onMatch="exclude"  caseSensitive="false">
  ^((?!\byoga\b)[\s\S])*$
</filter>

The regex says to match all docs not containing the word "yoga" (and exclude them).

If not intuitive enough, this logic may have to be revisited. I am now wondering if the current logic should be considered a bug, preventing the "include" from ever working properly (it may need to change to match what you expected). I am open to suggestions to make the filtering easier to use. I would ideally like to have a big list of use cases and document how each case should behave, helping both for the implementation and general usage.

Let me know how the above goes.

essiembre commented 9 years ago

After further investigation and testing, it turns out there was a bug with the onMatch=include. The way you have it should have worked. I created a new snapshot release of the Importer module with the fix in it. With that release what you have will work. I also documented the expected behavior according to different use cases here.

essiembre commented 9 years ago

FYI, I also just made a new snapshot release of the HTTP Collector

madsbrydegaard commented 9 years ago

Great thanks.

essiembre commented 9 years ago

Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.