Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

cannot crawl via proxy: BasicHttpRequest cannot be cast to HttpUriRequest #167

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

hi Pascal, first of all, I'd like to thank you and your team for the developing a new free crawler! Since many years we've been trying to find an alternative solution for the Autonomy http connector/fetch. And I must say, the norconex http-collector is very promising crawler! We'll keep an eye on it :)

So, the very first issue I found: the downloading via proxy does not work: Version: norconex-collector-http-2.3.0-SNAPSHOT

<proxyHost>proxy.intranet.com</proxyHost>
<proxyPort>8080</proxyPort>
<proxyScheme>http</proxyScheme>

and the error:

www.site.de: 2015-10-15 11:05:41 DEBUG - ACCEPTED document reference. Reference=https://www.site.de/index.htm Filter=com.norconex.collector.core.filter.impl.ExtensionReferenceFilter@262b2c86[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,caseSensitive=false]
www.site.de: 2015-10-15 11:05:41 DEBUG - Queued for processing: https://www.site.de/index.htm
www.site.de: 2015-10-15 11:05:41 INFO - 1 start URLs identified.
www.site.de: 2015-10-15 11:05:41 INFO -           CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@30c93896)
www.site.de: 2015-10-15 11:05:41 INFO - www.site.de: Crawling references...
www.site.de: 2015-10-15 11:05:41 DEBUG - www.site.de: Crawler thread #1 started.
www.site.de: 2015-10-15 11:05:41 DEBUG - www.site.de: Processing reference: https://www.site.de/index.htm
www.site.de: 2015-10-15 11:05:41 DEBUG - Fetching document: https://www.site.de/index.htm
www.site.de: 2015-10-15 11:05:41 DEBUG - Encoded URI: https://www.site.de/index.htm
www.site.de: 2015-10-15 11:05:41 ERROR - Cannot fetch document: https://www.site.de/index.htm (org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest)
java.lang.ClassCastException: org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest
    at com.norconex.collector.http.crawler.HttpCrawlerRedirectStrategy.isRedirected(HttpCrawlerRedirectStrategy.java:65)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:113)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110)
    at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:48)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:284)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:475)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:376)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:703)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
www.site.de: 2015-10-15 11:05:41 INFO -            REJECTED_ERROR: https://www.site.de/index.htm (Subject: com.norconex.collector.core.CollectorException: java.lang.ClassCastException: org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest)
www.site.de: 2015-10-15 11:05:41 ERROR - www.site.de: Could not process document: https://www.site.de/index.htm (java.lang.ClassCastException: org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest)
com.norconex.collector.core.CollectorException: java.lang.ClassCastException: org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest
    at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:170)
    at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:48)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:284)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:475)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:376)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:703)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.http.message.BasicHttpRequest cannot be cast to org.apache.http.client.methods.HttpUriRequest
    at com.norconex.collector.http.crawler.HttpCrawlerRedirectStrategy.isRedirected(HttpCrawlerRedirectStrategy.java:65)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:113)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110)
    ... 11 more

The robots.txt cannot be downloaded/checked with the same error.

Do you have an idea, what can be wrong here? Is it issue with our proxy server? Autonomy http connector does work with the same proxy :) Thank you!

essiembre commented 8 years ago

Thank your for your interest with our crawler. It's been used in different Autonomy projects with good success already (DIH and CFS).

Your problem does not appear to be a proxy problem, but a coding issue probably introduced in 2.3.0-SNAPSHOT. Stay tuned for a fix.

essiembre commented 8 years ago

I could not replicate in your environment, but I think I managed to fix the issue nonetheless. Please give this new snapshot a try.

jetnet commented 8 years ago

hi Pascal,

I can confirm - the issue is gone (tested with norconex-collector-http-2.3.0-20151017.032546-23.zip). Thank you very much for a such quick response / fix! Some time ago, we had an issue with the Autonomy fetch (it could not work with proxy too), and it took 3 or 4 months to get a proper fix from Autonomy (with escalations and working on-site and so on)! I feel the difference already! :)