Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Fawlty links causes Norconex to throw pages away #122

Closed Betongsuggan closed 9 years ago

Betongsuggan commented 9 years ago

I encountered a page where the link <a href="http://Tel:011- 15 14 54" ...>" was present. It is obviously a fawlty designed URL. However, when encountering this URL, Norconex discards the current page with it, throwing the following stack trace:

(Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt) java.lang.IllegalArgumentException: Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt at java.net.URI.create(URI.java:859) at org.apache.http.client.methods.HttpGet.(HttpGet.java:69) at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:75) at com.norconex.collector.http.pipeline.queue.HttpQueuePipelineContext.(HttpQueuePipelineContext.java:41) at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:91) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.net.URISyntaxException: Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.parseAuthority(URI.java:3167) at java.net.URI$Parser.parseHierarchical(URI.java:3078) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.(URI.java:595) at java.net.URI.create(URI.java:857) ... 14 more

essiembre commented 9 years ago

Which version are you using? The 2.2.0 snapshot release should fix that. Can you confirm if using the latest snapshot resolves this for you? You can get it here.

essiembre commented 9 years ago

This issue is likely a duplicate of #119.

essiembre commented 9 years ago

Closing since a duplicate, and a fix has been provided in latest stable release.