Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Unexpected REJECTED_FILTER #269

Closed olgapshen closed 8 years ago

olgapshen commented 8 years ago

Hello, I have follow configuration (file is attached) config.txt

Now, I run the job, and as result see:

INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://en.wikipedia.org/wiki/Tiger ... INFO [CrawlerEventManager] URLS_EXTRACTED: https://en.wikipedia.org/wiki/Tiger INFO [CrawlerEventManager] REJECTED_FILTER: https://en.wikipedia.org/wiki/Tiger

In mvstore I have:

processedInvalid https://en.wikipedia.org/wiki/Tiger: https://en.wikipedia.org/wiki/Tiger - REJECTED

This is the code which generate the line from MVStore:

    MVMap<String, ICrawlData> map = s.openMap("processedInvalid");

    System.out.println(name);

    for (String key : map.keySet()) {
      ICrawlData data = map.get(key);
      String ref = data.getReference();
      CrawlState state = data.getState();
      System.out.println(key + ": " + ref + " - " + state);
    }

So why, despite the line:

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include"
                caseSensitive="false">.*</filter>
      </documentFilters>

it rejected? I tried with regexp "tiger" too.

essiembre commented 8 years ago

To find out exactly why it was rejected, modify the log4j.properties file and set the log level to DEBUG. Especially for REJECTED_FILTER:

log4j.logger.CrawlerEvent.REJECTED_FILTER=DEBUG

What do you get in the logs after you do so?

olgapshen commented 8 years ago

Hi, thank you for so quick reply! Now it gives:

INFO [CrawlerEventManager] REJECTED_UNMODIFIED: https://en.wikipedia.org/wiki/Tiger (Subject: LastModifiedMetadataChecksummer[keep=false,targetField=collector.checksum-metadata,disabled=false])

I see that I need do kind of "clean"...

essiembre commented 8 years ago

Rgardless of the exact cause, there is no point to create a filter that says to accept ALL since that's the default behavior. I would simply remove that portion from your config.

Which version are you using? Your attached config does not seem to be the one you use. You have a maxDepth of 0 which will not crawl much beyond the first page so when I tried to reproduce, I could not get the REJECTED_REJECTED_FILTER and your start URL was crawled properly.

REJECTED_UNMODIFIED means the page has not changed since last crawled. If you want to start a "fresh" crawl and not perform "incremental" crawls (which picks up only new/modified/deleted pages), delete your workdir before you start it again (especially the "crawlstore" directory).

olgapshen commented 8 years ago

About config: yes sorry may be it different, I added the last version of config in bottom of post (I can not ad .xml file unfortunatelly... I need to rename it to .txt). Now I use just "tiger" expression. I understood your point about delete workdir. Version: I don't know where to check it... here is no README file. I downloaded last version yesterday.

Now I have other problem, I get the page in processedValid:

processedValid https://en.wikipedia.org/wiki/Tiger: https://en.wikipedia.org/wiki/Tiger - UNMODIFIED

I change the string to "tiger1" and it still remain there...

<?xml version="1.0" encoding="UTF-8"?>
<!--
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.
     -->
<httpcollector id="tigrcollector">

  <!-- Decide where to store generated files. -->
  <progressDir>./tigrsoft/progress</progressDir>
  <logsDir>./tigrsoft/logs</logsDir>

  <crawlers>
    <crawler id="tigrscrawler">

      <!-- Requires at least one start URL (or urlsFile).
           Optionally limit crawling to same protocol/domain/port as
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://en.wikipedia.org/wiki/Tiger</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./tigrsoft</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Before 2.3.0: -->
      <sitemap ignore="true" />
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include"
                caseSensitive="false">tiger</filter>
      </documentFilters>

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./tigrsoft/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
olgapshen commented 8 years ago

I mean, where I see the list of urls which are passed the check? Isn't it processedValid map?

essiembre commented 8 years ago

The version is in the filename you downloaded (e.g. norconex-collector-http-2.5.0.zip) , and if you run it from the command prompt, the first few lines will have version information as well:

[non-job]: 2016-06-27 14:10:57 INFO - Version: "Collector" version is X.X.X.

But I get you are using a recent version if you got it yesterday. :-)

The URLs that are processed just fine will be "committed" (sent to your committer). In your case, as you are using the FileSystemCommitter, files that were processed properly will end up under the location you specified: ./tigrsoft/crawledFiles. That's fine for testing, but in the end, you probably want to use your own Committer to send crawled files where/how you want them.

In your log, if you wan to see a trace of a document that was not rejected and sent for committing, you can look for something like this in your logs:

INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://en.wikipedia.org/wiki/Tiger

If you get UNMODIFIED it is because you have not cleared your workdir/crawlstore before running and it remembers documents previously cawled.

I am afraid I am not fully understanding what is not working for you. Can you give a concrete use case of what you are getting or not getting that you suspect is not working right?

olgapshen commented 8 years ago

OK, see I added the config file at the bottom and log files. I ran it for 5 minutes already, and it still didn't created the crawledFiles folder.

Here is folder tree:

olga@olgacomp:~/norconex/tigrsoft$ tree
.
├── crawlstore
│   └── mvstore
│       └── tigrscrawler
│           └── mvstore
├── logs
│   └── latest
│       └── logs
│           └── tigrscrawler.log
├── progress
│   └── latest
│       ├── status
│       │   └── tigrscrawler__tigrscrawler.job
│       └── tigrscrawler.index
└── sitemaps
    └── tigrscrawler
        ├── mapdb
        ├── mapdb.p
        └── mapdb.t

Here is the log: tigrscrawler.txt

I run the command in this way: ./collector-http.sh -a start -c config.xml

And here is log4j properties: log4j.properties.txt

Say me what else you need.

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="tigrcollector">

  <progressDir>./tigrsoft/progress</progressDir>
  <logsDir>./tigrsoft/logs</logsDir>

  <crawlers>
    <crawler id="tigrscrawler">

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://en.wikipedia.org/wiki/Tiger</url>
      </startURLs>

      <workDir>./tigrsoft</workDir>

      <maxDepth>3</maxDepth>

      <sitemap ignore="true" />

      <sitemapResolverFactory ignore="true" />

      <delay default="5000" />

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include"
                caseSensitive="false">about</filter>
      </documentFilters>

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./tigrsoft/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 8 years ago

According to your logs, it is normal you have no "committed" documents since all URLs are being rejected. Your document filter says you want to keep only URLs that matches the regular expression "about". None are matching that in your log so the behavior is correct.

olgapshen commented 8 years ago

But here is MUCH "about"s... https://en.wikipedia.org/wiki/Tiger

essiembre commented 8 years ago

A document "reference" is a document unique identifier. In the case of the HTTP Collector you are using, it is the document URL. So the RegexReferenceFilter job is to filter documents based on URL matching your regular expression, not content.

The reference filter takes place before the document is downloaded to save unecessary downloads. If you want to filter documents based on patterns matching their content, it means they will have to be downloaded first, and your best option is to rely on filters found in the Importer module. Like this one:

<importer>
  <postParseHandlers>
    <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include">
      .*about.*
    </filter>
  </postParseHandlers>
</importer>

But it always best to filter as much as you can before download for performance consideration (granted it is not always possible).

Maybe it would help if you explain what you are trying to accomplish exactly? Grab all wikipedia pages that contain the word "about"? Then the above example will do it, but it will generate quite a massive crawl.

olgapshen commented 8 years ago

OK! I see! It works!!! Thank you very much! Sorry for late answer )