Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Sitemap (Read timed out/Connection Reset), how to stop the crawler if this happen? #465

Closed nFaradiana closed 3 years ago

nFaradiana commented 6 years ago

Hi,

I'm using norconex to crawl Sitemap into Solr. But when there is Read Time Out or Connection Reset and few others scenario, the crawler still proceed and end up all my data in Solr has been removed.

Is there any way to stop the crawler if scenario happen?

essiembre commented 6 years ago

You can have it ignore such problems. Have a look at GenericSpoiledReferenceStrategizer, in combination with setting <orphansStrategy> to IGNORE.

nFaradiana commented 6 years ago

i did try this one.

1st time crawl, it will do DOCUMENT_FETCHED,CREATED_ROBOTS_META, URLS_EXTRACTED, DOCUMENT_IMPORTED and DOCUMENT_COMMITTED_ADD.

2nd time crawl, it will REJECTED_PREMATURE, because its already there.

3rd time crawl, (sitemap error), it did IGNORE and does not delete or process any record

4th time crawl, (sitemap resolved), it will do like 1st time crawl. will it be possible if sitemap resolved, it do like 2nd time crawl?

nFaradiana commented 6 years ago

and is it possible if i want to use DELETE (delete record if not in the sitemap), but if got ERROR for sitemap (cannot resolved/timeout/etc) , i want to stop crawl.

essiembre commented 6 years ago

The REJECTED_PREMATURE is not related to the issue you were facing. It is the result of using a "Recrawlable Resolver" and not enough elapsed time has passed between your two crawls.

For your connection timeout, was an exception thrown in the logs? If so, you can tell the crawler to stop if that error is encountered with:

    <stopOnExceptions>
        <exception>com.whatever.ExampleException</exception>
    </stopOnExceptions>
nFaradiana commented 6 years ago

Hi , i try it as below, anything i miss to add?

com.norconex.committer.core.CommitterException

Exception in thread "main" java.lang.NoSuchMethodError: com.norconex.commons.lang.config.XMLConfigurationUtil.getNullableClass(Lorg/apache/commons/configuration/HierarchicalConfiguration;Ljava/lang/String;Ljava/lang/Class;)Ljava/lang/Class; at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadStopOnExceptions(AbstractCrawlerConfig.java:407) at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:340) at com.norconex.commons.lang.config.XMLConfigurationUtil.loadFromXML(XMLConfigurationUtil.java:445) at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:120) at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:80) at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:304) at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78) at com.norconex.collector.core.AbstractCollectorLauncher.loadCommandLineConfig(AbstractCollectorLauncher.java:141) at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:92) at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)

--> no more error, my norconex-commons-lang is 1.13, using the 1.14 now and its okay.

nFaradiana commented 6 years ago

The REJECTED_PREMATURE is not related to the issue you were facing. It is the result of using a "Recrawlable Resolver" and not enough elapsed time has passed between your two crawls.

for above, is there any place to set the elapsed time between the crawl?

essiembre commented 6 years ago

If you have not set it up yourself already, it probably takes the instructions from the site sitemap.xml. You can overwrite those and/or set your own using the GenericRecrawlableResolver.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.