Closed liar666 closed 4 years ago
Here is one way that will do it without getting into coding (add to your <crawler>
):
<httpClientFactory>
<maxRedirects>0</maxRedirects>
</httpClientFactory>
<documentFetcher>
<validStatusCodes>200,302</validStatusCodes>
</documentFetcher>
Disabling redirects and considering 302 a valid status code does it for the URL you are interested it. It may have a negative impact on "real" redirects though. If so you may have to look for an alternate solution, or, define two crawlers: one with redirect disabled and one enabled (each with different start URLs and reference filters).
Thanks, it seems to work!
Hi,
Sorry to reopen this issue, but I have a similar problem with another website ( https://elibrary.ru/ ) and I can't manage to find a solution, even with the parameters maxRedirects
and validStatusCodes
.
You'll find attached a simplified version of my crawler for a specific page.
The problem seems to be that when you connect to a page like https://elibrary.ru/item.asp?id=29124289 , a transparent redirection occurs to a page that creates a session https://elibrary.ru/start_session.asp?rpage=xxx , returning a 302 status. But when I accept the 302 status with validStatusCodes
, as you showed me earlier in this issue, then I get an empty page (or an "object moved" page).
When I use the basic crawler (deactivation of the section of the crawler identified with HERE), I get:
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] eLibraryRu: Crawling references...
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://elibrary.ru/item.asp?id=29124289 (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Object moved (https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289)])
INFO [CrawlerEventManager] REJECTED_FILTER: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289 (No "include" reference filters matched.)
INFO [AbstractCrawler] eLibraryRu: Reprocessing any cached/orphan references...
INFO [AbstractCrawler] eLibraryRu: Crawler finishing: committing documents.
When I activate this section, I get:
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] eLibraryRu: Crawling references...
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://elibrary.ru/item.asp?id=29124289
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://elibrary.ru/item.asp?id=29124289
INFO [CrawlerEventManager] URLS_EXTRACTED: https://elibrary.ru/item.asp?id=29124289
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://elibrary.ru/item.asp?id=29124289
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://elibrary.ru/item.asp?id=29124289
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO [CrawlerEventManager] URLS_EXTRACTED: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO [CrawlerEventManager] REJECTED_IMPORT: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (com.norconex.importer.response.ImporterResponse@41ad9d5)
INFO [AbstractCrawler] eLibraryRu: 100% completed (2 processed/2 total)
INFO [AbstractCrawler] eLibraryRu: Reprocessing any cached/orphan references...
INFO [AbstractCrawler] eLibraryRu: Crawler finishing: committing documents.
INFO [AbstractCrawler] eLibraryRu: 2 reference(s) processed.
INFO [CrawlerEventManager] CRAWLER_FINISHED
I've tried to play with the maxRedirects
and validStatusCodes
parameters and to accept start_session
pages in the preParseHandlers filters, but nothing works :{. I've also played with my browser's (FireFox) "Developer Tools" and wget to fetch the various pages an see what occurs "behind the scene", but with no luck.
If you have any idea...
The original URL redirects to the "start_session" one, which again redirects to your original one (which at this point was already processed).
Assuming the "start_session" is not invoked with each URL, I recommend you modify your start URL to be: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289
Then it should work fine.
Hi again,
Sorry for the delay in answering.
Unfortunately, following how I set maxRedirects
(resp. 1 or 0), I get either:
INFO [CrawlerEventManager] REJECTED_FILTER: https://elibrary.ru/page_error.asp (No "include" reference filters matched.)
(which is what you get if you try to access "https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289" in your browser...
or
INFO [CrawlerEventManager] REJECTED_ERROR: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (java.lang.IllegalArgumentException: String must not be empty)
ERROR [AbstractCrawler] eLibraryRu: Could not process document: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (String must not be empty)
java.lang.IllegalArgumentException: String must not be empty
at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
at org.jsoup.select.Selector.<init>(Selector.java:83)
at org.jsoup.select.Selector.select(Selector.java:108)
at org.jsoup.nodes.Element.select(Element.java:322)
at com.norconex.importer.handler.tagger.impl.DOMTagger.domExtractDoc(DOMTagger.java:321)
at com.norconex.importer.handler.tagger.impl.DOMTagger.domExtractDocList(DOMTagger.java:305)
at com.norconex.importer.handler.tagger.impl.DOMTagger.tagApplicableDocument(DOMTagger.java:293)
at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:53)
at com.norconex.importer.Importer.tagDocument(Importer.java:514)
at com.norconex.importer.Importer.executeHandlers(Importer.java:345)
at com.norconex.importer.Importer.importDocument(Importer.java:304)
at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
at com.norconex.importer.Importer.importDocument(Importer.java:190)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm out of ideas. I would welcome any hint on how I could crawl this site...
The latest snapshot now captures the redirect trail in a new metadata field. With that information in hand, it may be easier to figure out a circular redirection pattern tied to authentication. I will try to research a solution when I have a chance.
In the meantime, here is an idea to try: Add a URL fragment (hash sign, like #blah
) to your start URL. Then change the default URL normalization rules with GenericURLNormalizer
so that fragments are no longer removed (they are by default). That way, maybe your authentication redirect will drop the hashtag when it sends the redirect. Then it would be two different URLs and the crawler would not think it was already processed.
Hello again,
Thanks for the prompt reply. Yes the idea is very interesting, I didn't thought about that one!
I've tried it but still get the String must not be empty
. I'll try various options to see if I can get it to work.
Why did you close this ticket? Did you get it working?
Oooops sorry, no it was a mistake.
Any update on whether the latest snapshot did it for you?
Sorry nothing sure for the moment, I had to switch to another task :{ I did some unsuccessful tests, but I'm not sure the failure did not came from a bad settings on my side....
Hi,
I'm trying to crawl the following page: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162 This page first redirects to: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1 which in turn redirects back to: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162
As can be seen in the log extract below, the problem is that collector-http refuses to crawl the initial page a second time. In this case norconex's collector-http is right and the website shows a "wrong" behaviour.
However, I have no if I want to crawl the site but to crawl twice some urls. Is there a way to force norconex to act like so?
Thanks in advance