Closed Betongsuggan closed 9 years ago
Which version are you using? The 2.2.0 snapshot release should fix that. Can you confirm if using the latest snapshot resolves this for you? You can get it here.
This issue is likely a duplicate of #119.
Closing since a duplicate, and a fix has been provided in latest stable release.
I encountered a page where the link <a href="http://Tel:011- 15 14 54" ...>" was present. It is obviously a fawlty designed URL. However, when encountering this URL, Norconex discards the current page with it, throwing the following stack trace:
(Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt) java.lang.IllegalArgumentException: Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt at java.net.URI.create(URI.java:859) at org.apache.http.client.methods.HttpGet.(HttpGet.java:69)
at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:75)
at com.norconex.collector.http.pipeline.queue.HttpQueuePipelineContext.(HttpQueuePipelineContext.java:41)
at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:91)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.URISyntaxException: Illegal character in authority at index 7: http://Tel:011-15 14 54/robots.txt
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.parseAuthority(URI.java:3167)
at java.net.URI$Parser.parseHierarchical(URI.java:3078)
at java.net.URI$Parser.parse(URI.java:3034)
at java.net.URI.(URI.java:595)
at java.net.URI.create(URI.java:857)
... 14 more