Closed olgapshen closed 8 years ago
To find out exactly why it was rejected, modify the log4j.properties
file and set the log level to DEBUG. Especially for REJECTED_FILTER
:
log4j.logger.CrawlerEvent.REJECTED_FILTER=DEBUG
What do you get in the logs after you do so?
Hi, thank you for so quick reply! Now it gives:
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: https://en.wikipedia.org/wiki/Tiger (Subject: LastModifiedMetadataChecksummer[keep=false,targetField=collector.checksum-metadata,disabled=false])
I see that I need do kind of "clean"...
Rgardless of the exact cause, there is no point to create a filter that says to accept ALL since that's the default behavior. I would simply remove that portion from your config.
Which version are you using? Your attached config does not seem to be the one you use. You have a maxDepth of 0 which will not crawl much beyond the first page so when I tried to reproduce, I could not get the REJECTED_REJECTED_FILTER and your start URL was crawled properly.
REJECTED_UNMODIFIED means the page has not changed since last crawled. If you want to start a "fresh" crawl and not perform "incremental" crawls (which picks up only new/modified/deleted pages), delete your workdir before you start it again (especially the "crawlstore" directory).
About config: yes sorry may be it different, I added the last version of config in bottom of post (I can not ad .xml file unfortunatelly... I need to rename it to .txt). Now I use just "tiger" expression. I understood your point about delete workdir. Version: I don't know where to check it... here is no README file. I downloaded last version yesterday.
Now I have other problem, I get the page in processedValid:
processedValid https://en.wikipedia.org/wiki/Tiger: https://en.wikipedia.org/wiki/Tiger - UNMODIFIED
I change the string to "tiger1" and it still remain there...
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2015 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
to run a crawler.
-->
<httpcollector id="tigrcollector">
<!-- Decide where to store generated files. -->
<progressDir>./tigrsoft/progress</progressDir>
<logsDir>./tigrsoft/logs</logsDir>
<crawlers>
<crawler id="tigrscrawler">
<!-- Requires at least one start URL (or urlsFile).
Optionally limit crawling to same protocol/domain/port as
start URLs. -->
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://en.wikipedia.org/wiki/Tiger</url>
</startURLs>
<!-- === Recommendations: ============================================ -->
<!-- Specify a crawler default directory where to generate files. -->
<workDir>./tigrsoft</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>0</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<!-- Before 2.3.0: -->
<sitemap ignore="true" />
<!-- Since 2.3.0: -->
<sitemapResolverFactory ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="5000" />
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include"
caseSensitive="false">tiger</filter>
</documentFilters>
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./tigrsoft/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
I mean, where I see the list of urls which are passed the check? Isn't it processedValid map?
The version is in the filename you downloaded (e.g. norconex-collector-http-2.5.0.zip) , and if you run it from the command prompt, the first few lines will have version information as well:
[non-job]: 2016-06-27 14:10:57 INFO - Version: "Collector" version is X.X.X.
But I get you are using a recent version if you got it yesterday. :-)
The URLs that are processed just fine will be "committed" (sent to your committer). In your case, as you are using the FileSystemCommitter, files that were processed properly will end up under the location you specified: ./tigrsoft/crawledFiles
. That's fine for testing, but in the end, you probably want to use your own Committer to send crawled files where/how you want them.
In your log, if you wan to see a trace of a document that was not rejected and sent for committing, you can look for something like this in your logs:
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://en.wikipedia.org/wiki/Tiger
If you get UNMODIFIED it is because you have not cleared your workdir/crawlstore before running and it remembers documents previously cawled.
I am afraid I am not fully understanding what is not working for you. Can you give a concrete use case of what you are getting or not getting that you suspect is not working right?
OK, see I added the config file at the bottom and log files. I ran it for 5 minutes already, and it still didn't created the crawledFiles folder.
Here is folder tree:
olga@olgacomp:~/norconex/tigrsoft$ tree
.
├── crawlstore
│ └── mvstore
│ └── tigrscrawler
│ └── mvstore
├── logs
│ └── latest
│ └── logs
│ └── tigrscrawler.log
├── progress
│ └── latest
│ ├── status
│ │ └── tigrscrawler__tigrscrawler.job
│ └── tigrscrawler.index
└── sitemaps
└── tigrscrawler
├── mapdb
├── mapdb.p
└── mapdb.t
Here is the log: tigrscrawler.txt
I run the command in this way: ./collector-http.sh -a start -c config.xml
And here is log4j properties: log4j.properties.txt
Say me what else you need.
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="tigrcollector">
<progressDir>./tigrsoft/progress</progressDir>
<logsDir>./tigrsoft/logs</logsDir>
<crawlers>
<crawler id="tigrscrawler">
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://en.wikipedia.org/wiki/Tiger</url>
</startURLs>
<workDir>./tigrsoft</workDir>
<maxDepth>3</maxDepth>
<sitemap ignore="true" />
<sitemapResolverFactory ignore="true" />
<delay default="5000" />
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include"
caseSensitive="false">about</filter>
</documentFilters>
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./tigrsoft/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
According to your logs, it is normal you have no "committed" documents since all URLs are being rejected. Your document filter says you want to keep only URLs that matches the regular expression "about". None are matching that in your log so the behavior is correct.
But here is MUCH "about"s... https://en.wikipedia.org/wiki/Tiger
A document "reference" is a document unique identifier. In the case of the HTTP Collector you are using, it is the document URL. So the RegexReferenceFilter job is to filter documents based on URL matching your regular expression, not content.
The reference filter takes place before the document is downloaded to save unecessary downloads. If you want to filter documents based on patterns matching their content, it means they will have to be downloaded first, and your best option is to rely on filters found in the Importer module. Like this one:
<importer>
<postParseHandlers>
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include">
.*about.*
</filter>
</postParseHandlers>
</importer>
But it always best to filter as much as you can before download for performance consideration (granted it is not always possible).
Maybe it would help if you explain what you are trying to accomplish exactly? Grab all wikipedia pages that contain the word "about"? Then the above example will do it, but it will generate quite a massive crawl.
OK! I see! It works!!! Thank you very much! Sorry for late answer )
Hello, I have follow configuration (file is attached) config.txt
Now, I run the job, and as result see:
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://en.wikipedia.org/wiki/Tiger ... INFO [CrawlerEventManager] URLS_EXTRACTED: https://en.wikipedia.org/wiki/Tiger INFO [CrawlerEventManager] REJECTED_FILTER: https://en.wikipedia.org/wiki/Tiger
In mvstore I have:
processedInvalid https://en.wikipedia.org/wiki/Tiger: https://en.wikipedia.org/wiki/Tiger - REJECTED
This is the code which generate the line from MVStore:
So why, despite the line:
it rejected? I tried with regexp "tiger" too.