Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

(voluntary) Bad usage of "redirect" by website leads to no crawling #361

Closed liar666 closed 4 years ago

liar666 commented 7 years ago

Hi,

I'm trying to crawl the following page: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162 This page first redirects to: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1 which in turn redirects back to: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162

As can be seen in the log extract below, the problem is that collector-http refuses to crawl the initial page a second time. In this case norconex's collector-http is right and the website shows a "wrong" behaviour.

ACS: 2017-06-28 17:27:27 DEBUG - Fetching document: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162
ACS: 2017-06-28 17:27:27 DEBUG - Encoded URI: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162
ACS: 2017-06-28 17:27:28 DEBUG - URL redirect: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162 -> http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1
ACS: 2017-06-28 17:27:28 DEBUG - Unsupported HTTP Response: HTTP/1.1 302 Found
ACS: 2017-06-28 17:27:28 INFO -       REJECTED_REDIRECTED: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162
ACS: 2017-06-28 17:27:28 DEBUG - Thread pool-1-thread-1 sleeping for 7.611 seconds.
ACS: 2017-06-28 17:27:35 DEBUG - Fetching document: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1
ACS: 2017-06-28 17:27:35 DEBUG - Encoded URI: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1
ACS: 2017-06-28 17:27:35 DEBUG - URL redirect: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1 -> http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162
ACS: 2017-06-28 17:27:35 DEBUG - Unsupported HTTP Response: HTTP/1.1 302 Found
ACS: 2017-06-28 17:27:35 DEBUG - Redirect target URL is already processed: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162 (from: http://pubs.acs.org/doi/abs/10.1021/acschemneuro.7b00162?cookieSet=1).
ACS: 2017-06-28 17:27:35 INFO - ACS: 100% completed (2 processed/2 total)

However, I have no if I want to crawl the site but to crawl twice some urls. Is there a way to force norconex to act like so?

Thanks in advance

essiembre commented 7 years ago

Here is one way that will do it without getting into coding (add to your <crawler>):

      <httpClientFactory>
        <maxRedirects>0</maxRedirects>
      </httpClientFactory>
      <documentFetcher>
        <validStatusCodes>200,302</validStatusCodes>
      </documentFetcher>     

Disabling redirects and considering 302 a valid status code does it for the URL you are interested it. It may have a negative impact on "real" redirects though. If so you may have to look for an alternate solution, or, define two crawlers: one with redirect disabled and one enabled (each with different start URLs and reference filters).

liar666 commented 7 years ago

Thanks, it seems to work!

liar666 commented 7 years ago

Hi,

Sorry to reopen this issue, but I have a similar problem with another website ( https://elibrary.ru/ ) and I can't manage to find a solution, even with the parameters maxRedirects and validStatusCodes.

You'll find attached a simplified version of my crawler for a specific page.

The problem seems to be that when you connect to a page like https://elibrary.ru/item.asp?id=29124289 , a transparent redirection occurs to a page that creates a session https://elibrary.ru/start_session.asp?rpage=xxx , returning a 302 status. But when I accept the 302 status with validStatusCodes, as you showed me earlier in this issue, then I get an empty page (or an "object moved" page).

When I use the basic crawler (deactivation of the section of the crawler identified with HERE), I get:

INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] eLibraryRu: Crawling references...
INFO  [CrawlerEventManager]       REJECTED_REDIRECTED: https://elibrary.ru/item.asp?id=29124289 (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Object moved (https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289)])
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289 (No "include" reference filters matched.)
INFO  [AbstractCrawler] eLibraryRu: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] eLibraryRu: Crawler finishing: committing documents.

When I activate this section, I get:

INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] eLibraryRu: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://elibrary.ru/item.asp?id=29124289
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://elibrary.ru/item.asp?id=29124289
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://elibrary.ru/item.asp?id=29124289
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://elibrary.ru/item.asp?id=29124289
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://elibrary.ru/item.asp?id=29124289
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289
INFO  [CrawlerEventManager]           REJECTED_IMPORT: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (com.norconex.importer.response.ImporterResponse@41ad9d5)
INFO  [AbstractCrawler] eLibraryRu: 100% completed (2 processed/2 total)
INFO  [AbstractCrawler] eLibraryRu: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] eLibraryRu: Crawler finishing: committing documents.
INFO  [AbstractCrawler] eLibraryRu: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED

eLibraryRu.norconex.xml.txt

I've tried to play with the maxRedirects and validStatusCodes parameters and to accept start_session pages in the preParseHandlers filters, but nothing works :{. I've also played with my browser's (FireFox) "Developer Tools" and wget to fetch the various pages an see what occurs "behind the scene", but with no luck.

If you have any idea...

essiembre commented 7 years ago

The original URL redirects to the "start_session" one, which again redirects to your original one (which at this point was already processed).

Assuming the "start_session" is not invoked with each URL, I recommend you modify your start URL to be: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289

Then it should work fine.

liar666 commented 7 years ago

Hi again,

Sorry for the delay in answering.

Unfortunately, following how I set maxRedirects (resp. 1 or 0), I get either:

INFO  [CrawlerEventManager]           REJECTED_FILTER: https://elibrary.ru/page_error.asp (No "include" reference filters matched.)

(which is what you get if you try to access "https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary%2Eru%2Fitem%2Easp%3Fid%3D29124289" in your browser...

or

INFO  [CrawlerEventManager]            REJECTED_ERROR: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (java.lang.IllegalArgumentException: String must not be empty)
ERROR [AbstractCrawler] eLibraryRu: Could not process document: https://elibrary.ru/start_session.asp?rpage=https%3A%2F%2Felibrary.ru%2Fitem.asp%3Fid%3D29124289 (String must not be empty)
java.lang.IllegalArgumentException: String must not be empty
        at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
        at org.jsoup.select.Selector.<init>(Selector.java:83)
        at org.jsoup.select.Selector.select(Selector.java:108)
        at org.jsoup.nodes.Element.select(Element.java:322)
        at com.norconex.importer.handler.tagger.impl.DOMTagger.domExtractDoc(DOMTagger.java:321)
        at com.norconex.importer.handler.tagger.impl.DOMTagger.domExtractDocList(DOMTagger.java:305)
        at com.norconex.importer.handler.tagger.impl.DOMTagger.tagApplicableDocument(DOMTagger.java:293)
        at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:53)
        at com.norconex.importer.Importer.tagDocument(Importer.java:514)
        at com.norconex.importer.Importer.executeHandlers(Importer.java:345)
        at com.norconex.importer.Importer.importDocument(Importer.java:304)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
        at com.norconex.importer.Importer.importDocument(Importer.java:190)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I'm out of ideas. I would welcome any hint on how I could crawl this site...

essiembre commented 7 years ago

The latest snapshot now captures the redirect trail in a new metadata field. With that information in hand, it may be easier to figure out a circular redirection pattern tied to authentication. I will try to research a solution when I have a chance.

In the meantime, here is an idea to try: Add a URL fragment (hash sign, like #blah) to your start URL. Then change the default URL normalization rules with GenericURLNormalizer so that fragments are no longer removed (they are by default). That way, maybe your authentication redirect will drop the hashtag when it sends the redirect. Then it would be two different URLs and the crawler would not think it was already processed.

liar666 commented 7 years ago

Hello again,

Thanks for the prompt reply. Yes the idea is very interesting, I didn't thought about that one!

I've tried it but still get the String must not be empty. I'll try various options to see if I can get it to work.

essiembre commented 7 years ago

Why did you close this ticket? Did you get it working?

liar666 commented 7 years ago

Oooops sorry, no it was a mistake.

essiembre commented 6 years ago

Any update on whether the latest snapshot did it for you?

liar666 commented 6 years ago

Sorry nothing sure for the moment, I had to switch to another task :{ I did some unsuccessful tests, but I'm not sure the failure did not came from a bad settings on my side....