Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Importer did not kick in #301

Closed tungdx closed 8 years ago

tungdx commented 8 years ago

In AbstractCrawler.java class, my crawler ran into processNextQueuedCrawlData() method and reached to a case has your TODO message "Fire an event here? If we get here, the importer did not kick in". It's happened with just some websites, some others worked well.

Here is my config for crawlers:

<crawler>
    <maxDepth>2</maxDepth>
    <numThreads>1</numThreads>
    <sitemapResolverFactory ignore="true" />
    <robotsTxt ignore="true" />
    <robotsMeta ignore="true" />
    <canonicalLinkDetector ignore="true" /> 

    <crawlDataStoreFactory
        class="com.norconex.collector.http.data.store.impl.mongo.MongoCrawlDataStoreFactory">
        <host>localhost</host>
        <port>27017</port>
    </crawlDataStoreFactory>

    <committer
        class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <indexName>test</indexName>
        <targetContentField>content</targetContentField>
        <queueSize>1</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
    </committer>
</crawler>        

How can I correct it? Thanks in advance!

essiembre commented 8 years ago

What is it you want to correct? What is your issue/problem? That TODO is not an issue. :-) It is merely questioning whether an extra event should be fired for those having registered event listeners with the crawler.

If you suspect you have documents that were rejected for invalid reasons, check the logs for the exact cause. You can also change the log level to DEBUG in the log4j.properties files on the rejections you are interested it to (sometimes) get more information. E.g.:

log4j.logger.CrawlerEvent.REJECTED_FILTER=DEBUG
tungdx commented 8 years ago

Sorry for this question. I will debug it more carefully. Thanks.