Closed nFaradiana closed 3 years ago
You can have it ignore such problems. Have a look at GenericSpoiledReferenceStrategizer, in combination with setting <orphansStrategy>
to IGNORE.
i did try this one.
1st time crawl, it will do DOCUMENT_FETCHED,CREATED_ROBOTS_META, URLS_EXTRACTED, DOCUMENT_IMPORTED and DOCUMENT_COMMITTED_ADD.
2nd time crawl, it will REJECTED_PREMATURE, because its already there.
3rd time crawl, (sitemap error), it did IGNORE and does not delete or process any record
4th time crawl, (sitemap resolved), it will do like 1st time crawl. will it be possible if sitemap resolved, it do like 2nd time crawl?
and is it possible if i want to use
The REJECTED_PREMATURE is not related to the issue you were facing. It is the result of using a "Recrawlable Resolver" and not enough elapsed time has passed between your two crawls.
For your connection timeout, was an exception thrown in the logs? If so, you can tell the crawler to stop if that error is encountered with:
<stopOnExceptions>
<exception>com.whatever.ExampleException</exception>
</stopOnExceptions>
Hi , i try it as below, anything i miss to add?
Exception in thread "main" java.lang.NoSuchMethodError: com.norconex.commons.lang.config.XMLConfigurationUtil.getNullableClass(Lorg/apache/commons/configuration/HierarchicalConfiguration;Ljava/lang/String;Ljava/lang/Class;)Ljava/lang/Class; at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadStopOnExceptions(AbstractCrawlerConfig.java:407) at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:340) at com.norconex.commons.lang.config.XMLConfigurationUtil.loadFromXML(XMLConfigurationUtil.java:445) at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:120) at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:80) at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:304) at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78) at com.norconex.collector.core.AbstractCollectorLauncher.loadCommandLineConfig(AbstractCollectorLauncher.java:141) at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:92) at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
--> no more error, my norconex-commons-lang is 1.13, using the 1.14 now and its okay.
The REJECTED_PREMATURE is not related to the issue you were facing. It is the result of using a "Recrawlable Resolver" and not enough elapsed time has passed between your two crawls.
for above, is there any place to set the elapsed time between the crawl?
If you have not set it up yourself already, it probably takes the instructions from the site sitemap.xml
. You can overwrite those and/or set your own using the GenericRecrawlableResolver.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi,
I'm using norconex to crawl Sitemap into Solr. But when there is Read Time Out or Connection Reset and few others scenario, the crawler still proceed and end up all my data in Solr has been removed.
Is there any way to stop the crawler if scenario happen?