Closed essiembre closed 9 years ago
@madsbrydegaard, what you have seems just fine. What version are you using? Is the code snippet you provided within an importer
tag? Can you paste a copy of your config?
Version is latest - just downloaded today...
Here goes full config. Both seed sites gets committed...
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to
run a crawler.
-->
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<!-- === Minimum required: =========================================== -->
<!-- Requires at least one start URL. -->
<startURLs>
<url>http://www.vidcore.com</url>
<url>http://www.lagocph.dk</url>
</startURLs>
<!-- === Minimum recommended: ======================================== -->
<!-- Where the crawler default directory to generate files is. -->
<workDir>./examples-output/minimum</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>0</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<sitemap ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="1000" />
<!-- At a minimum make sure you stay on your domain. -->
<referenceFilters>
<filter
class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
onMatch="exclude" >
jpg,gif,png,ico,css,js
</filter>
<!--
<filter
class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include" >
http://www\.norconex\.com/product/collector-http-test/.*
</filter>
-->
</referenceFilters>
<importer>
<postParseHandlers>
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
onMatch="include"
caseSensitive="false">
.*\byoga\b.*
</filter>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,keywords,description,document.reference</fields>
</tagger>
</postParseHandlers>
</importer>
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./examples-output/minimum/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
Pascal did you have a chance to look at my config?
Venlig hilsen / Kind regards Mads Brydegaard Tlf. (+45) 2868 1707
www.vidcore.com | Online video | Mobile video
2015-05-13 16:48 GMT+02:00 Pascal Essiembre notifications@github.com:
@madsbrydegaard https://github.com/madsbrydegaard, what you have seems just fine. What version are you using? Is the code snippet you provided within an importer tag? Can you paste a copy of your config?
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/108#issuecomment-101697101 .
Hello Mads,
I'll give your config a try and get back to you.
I was able to reproduce the bug. Let me check the source code.
Any news on the subject?
Venlig hilsen / Kind regards Mads Brydegaard Tlf. (+45) 2868 1707
2015-05-20 15:27 GMT+02:00 Martin Fournier notifications@github.com:
I was able to reproduce the bug. Let me check the source code.
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/108#issuecomment-103888508 .
I looked at the source code and I cannot find the bug for the life of me. I will need Pascal's help on this one.
I just had a chance to try your config and found out what is going on. For the onMatch
operation to take effect, there as to be a match in the first place. A document which does not trigger a match won't have the onMatch
operation applied to it, and the filter has no effect on it (so it goes through normally).
So to achieve what you want to accomplish, you have to match the documents you do not want, and specify onMatch="exclude"
. This has been tested to work with the latest snapshot version:
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
onMatch="exclude" caseSensitive="false">
^((?!\byoga\b)[\s\S])*$
</filter>
The regex says to match all docs not containing the word "yoga" (and exclude them).
If not intuitive enough, this logic may have to be revisited. I am now wondering if the current logic should be considered a bug, preventing the "include" from ever working properly (it may need to change to match what you expected). I am open to suggestions to make the filtering easier to use. I would ideally like to have a big list of use cases and document how each case should behave, helping both for the implementation and general usage.
Let me know how the above goes.
After further investigation and testing, it turns out there was a bug with the onMatch=include
. The way you have it should have worked. I created a new snapshot release of the Importer module with the fix in it. With that release what you have will work. I also documented the expected behavior according to different use cases here.
FYI, I also just made a new snapshot release of the HTTP Collector
Great thanks.
From @madsbrydegaard, moved from https://github.com/Norconex/collector-http/issues/48#issuecomment-101662531:
I tried implementing the filter option:
However pages without the "keyword" still gets committed. What is the correct way to use this feature?
Thnx