Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

SiteMinder Authentication #218

Open shubhamsamy opened 8 years ago

shubhamsamy commented 8 years ago

Hi, I am not able to crawl the redirected URL. I need to crawl a reference URL in the page which is being redirected to other site. I have attached snippet from log which tells that the crawl stage is 'Redirect', status code is 302 as show below: crawlState=REDIRECT, statusCode=302, reasonPhrase=Found

Please have a look and let us know as what could be the reason for this.

Regards, Sam redirect log.txt

jetnet commented 8 years ago

try this:

    <metadataFetcher class="${metaFetcher}" >
      <validStatusCodes>200,301,302</validStatusCodes>
    </metadataFetcher>
shubhamsamy commented 8 years ago

Hi, I have tried to add valid status codes but still I am getting the same error. I am attaching the configuration. I have removed site name etc as these are intranet site. Please have a look and let me know if there any thing missing in my configuration. Thanks & Regards, Sam crawler.txt

essiembre commented 8 years ago

Without ways to reproduce it is hard for me to comment, but looking at your log snippet, nothing indicates the redirect is not being crawled. Understanding these two lines from your log may help:

CrawlerIbnstance1: 2016-01-12 11:46:55 INFO -       REJECTED_REDIRECTED: http://abc.xyz.com/Download?docid=123&Status=FREE (Subject: HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4)])
CrawlerIbnstance1: 2016-01-12 11:46:55 DEBUG - Queued for processing: http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4

REJECTED_REDIRECTED means the original URL is being dropped in favor of the target URL. That target URL will be crawled unless it is rejected by some other rules you have defined in your config. The line saying Queueued for processing... tells you the target URL will be processed.

Further in your logs you should have indications whether it was indeed processed or not, and reasons why not if the case.

Does this help?

shubhamsamy commented 8 years ago

Hi Pascal, Thanks for your input. Page is queued and never get crawled as it is redirecting to a site which uses SiteMinder Authentication. Please let us know if there is plan to add the feature to support SiteMinder Authentication. Thank & Regards, Sam

essiembre commented 8 years ago

Do you by any chance have or know of a public login form with a test/demo account we can use to start on this?

shubhamsamy commented 8 years ago

Hi Pascal, I am sorry as these are intranet sites and are not available outside. Regards, Praveen

essiembre commented 8 years ago

I am marking this as a feature request.

It will likely remain open until we can get our hands on a public SiteMinder site we can use for testing/implementing this.

You can always contact Norconex to have someone work on your intranet to put this in place.

akshaybijawe commented 7 years ago

Hi Pascal, do you have any update regarding SiteMinder Authentication? Thanks.

essiembre commented 7 years ago

Hello Akshay, No update. Do you have a SiteMinder site with temporary access so we can give it a try?

Krishna210414 commented 6 years ago

Hi Any Update on this will be able to crawl siteminder authenticated url ? I was also facing same issue any help would be greatly appreciated.

essiembre commented 6 years ago

@Krishna210414 , the issue is the same: we need a sample SiteMinder site we can use as a test. You got one you can share?

Krishna210414 commented 6 years ago

Nope . I don't have one to share on the forum , But i want to know is there specific setting with which i will be able to crawl redirected url ?

Krishna210414 commented 6 years ago

It would be helpful if you can provide the way to pass targeted url as parameter to the authentication url.

wolverline commented 6 years ago

@shubhamsamy As I experienced, httpClientFactory login seems to have limited capabilities. Understandably there are so many different auth methods including federated login. I tried it with sites built upon Drupal which uses a regular form auth. If it doesn't work with your intranet, probably it is because it hops pages to get authenticated. In this case, using PhantomJS seems to the best bet with Norconex for now. I was able to crawl through both FORM and SAML auth. After all PhantomJS is a headless browser; it seems to take a bit too much hack esp. for SAML auth.

Krishna210414 commented 6 years ago

Thanks for the reply.Can you share your logic ?

wolverline commented 6 years ago

@Krishna210414 I am not sure if you have the same issue as @shubhamsamy does. If you're dealing with httpClientFactory, have you tried the following config?

<metadataFetcher class="$metaFetcher">
  <validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher>
Krishna210414 commented 6 years ago

i tried with that it didnt work

wolverline commented 6 years ago

If you're trying to do Form Auth, you can configure:

<httpcollector id="My Collector">
  <crawler id="$crawler-id">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>$crawler-url</url>
      </startURLs>

      <metadataFetcher class="$metaFetcher">
        <validStatusCodes>200,302,403</validStatusCodes>
      </metadataFetcher>

      <documentFetcher class="${http}.fetch.impl.PhantomJSDocumentFetcher"
        detectContentType="true" detectCharset="true" screenshotEnabled="true">
        <exePath>${run-path}</exePath>
        <scriptPath>${script-path}</scriptPath>
        <resourceTimeout>5000</resourceTimeout>
        <validStatusCodes>200,302,403</validStatusCodes>
        <notFoundStatusCodes>404</notFoundStatusCodes>
        <referencePattern>^https://.*</referencePattern>
        <renderWaitTime>3000</renderWaitTime>
        <screenshotDimensions>600x400</screenshotDimensions>
        <screenshotZoomFactor>0.25</screenshotZoomFactor>
        <screenshotScaleDimensions>300</screenshotScaleDimensions>
        <screenshotScaleStretch>false</screenshotScaleStretch>
        <screenshotScaleQuality>medium</screenshotScaleQuality>
        <screenshotImageFormat>png</screenshotImageFormat>
        <screenshotStorage>disk</screenshotStorage>
        <screenshotStorageDiskDir structure="url2path">${workdir}/screenshot</screenshotStorageDiskDir>
        <screenshotStorageDiskField>dummy</screenshotStorageDiskField>
      </documentFetcher>
......
    </crawler>
  </crawlers>
</httpcollector>

And add the following JS file. The code has ability to attempt form auth when (existing) session cookie is not valid. Note that the current documentFetcher version doesn't have ability to pass arguments. So config should be defined within js file. The following js code is a working example. However, the form auth method may be different from the sites that I am working on (in my case, they're Drupal sites). Chances are big that you have to customize/test/debug further.

/**
 * This file is used and is required by the PhantomJSDocumentFetcher.  
 * Modifying this file could break PhantomJSDocumentFetcher behavior.
 */
var webPage = require('webpage');
var page;
var loginPage;
var fs = require('fs');
var system = require('system');

// Phantomjs global config
phantom.cookiesEnabled = true;
phantom.javascriptEnabled = true;
phantom.state = 'no-state';

//#############
// Local config
// ############
var loginAttempt = 0;
var userName = "username";
var userPass = "password";
var workDir  = '/path/to/work/dir';
// Define session cookie file
// in order for PhantomJS to keep a session alive
// make it sure to be writable
var cookie = workDir + '/cookies/cookie.json';
var loginUrl  = 'https://example.com/login'; // site login link where a login form presents
var logoutUrl = 'https://example.com/logout'; // site logout link

if (system.args.length !== 10) {
  system.stderr.writeLine('Invalid number of arguments.');
  phantom.exit(1);
}

var url = system.args[1];           // The URL to fetch
var outfile = system.args[2];       // The temp output file
var timeout = system.args[3];       // How long to wait for the whole page to render
var bindId = system.args[4];        // HttpClient binding id
var protocol = system.args[5];      // Was the original URL "https" or "http"?
var thumbnailFile = system.args[6]; // Optional path to image file
var dimension = system.args[7];     // e.g. 1024x768
var zoomFactor = system.args[8];    // e.g. 0.25 (25%)
var resourceTimeout = system.args[9]; // timeout for a single page resource

var addCookieInfo = function() {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookie)), function(param) {
    phantom.addCookie(param);
  });
};

var removeCookies = function() {
  if (fs.exists(cookie)) {
    fs.remove(cookie);
  }
  if (loginPage === 'object') {
    loginPage.close();
  }
  loginPage = webPage.create();
  loginPage.open(logoutUrl, function(status) {
    if (status === "success") {}
  });
}

function runLogin() {
  if (loginPage === 'object') {
    loginPage.close();
  }
  if (loginAttempt > 2) {
    system.stderr.writeLine('Reached max login attempt.');
    phantom.exit();
  }
  else {
    loginAttempt++;
    loginPage = webPage.create();
    loginPage.open(loginUrl, function(status) {
      if (status === "success") {
        // system.stderr.writeLine('Form auth started.');
        /**
         * #############################################
         * NOTE: Login Form
         * Customize for UserID, Password, and Form fields
         * Or rewrite to pass each objects to this function
         * #############################################
         */
        loginPage.evaluate(function(uname, upass) {
          document.getElementById("username").value = uname;
          document.getElementById("userpass").value = upass;
          document.getElementById("loginform").submit();
          //docForm = document.getElementsByTagName("form");
          //docForm[0].submit();
        }, userName, userPass);

        loginPage.onLoadFinished = function(status) { 
          if (status === 'success') {
            if (!phantom.state || phantom.state == 'no-state') {
              phantom.state = 'no-session';
            }
            if (phantom.state === 'no-session') {
              fs.write(cookie, JSON.stringify(phantom.cookies), "w");
              phantom.state = 'run-state';
              setTimeout(runPage, 500);
            }
          }
        };
      }
    });
  }
}

/**
 * Set varabiles with Norconex options
 */
function setPage() {
  page.onResourceError = function(resourceError) {
    system.stderr.writeLine(resourceError.url + ': ' + resourceError.errorString);
  };
  if (thumbnailFile && dimension) {
    var pageWidth = 1024;
    var pageHeight = 768;
    if (dimension) {
      var size = dimension.split('x');
      pageWidth = parseInt(size[0], 10) * zoomFactor;
      pageHeight = parseInt(size[1], 10) * zoomFactor;
    }
    page.viewportSize = { width: pageWidth, height: pageHeight };
    page.clipRect = { top: 0, left: 0, width: pageWidth, height: pageHeight };
  }
  if (thumbnailFile && zoomFactor) {
    page.zoomFactor = zoomFactor;
  }

  if (bindId !== "-1") {
    page.customHeaders = {
      "collector.proxy.bindId": bindId,
      "collector.proxy.protocol": protocol
    };
  }
  if (resourceTimeout !== "-1") {
    page.settings.resourceTimeout = resourceTimeout;
  }
}

function runPage() {
  if (page === 'object') {
    page.close();
  } 
  page = webPage.create();
  addCookieInfo();
  setPage();
  page.open(url, function(status) {
    if (status !== 'success') {
      system.stderr.writeLine('Unsuccessful loading of: ' + url + ' (status=' + status + ').');
      system.stderr.writeLine('Content: ' + page.content);
      if (page.content) {
        fs.write(outfile, "error", 'w');
      }
      phantom.exit();
    }
    else {
      if (phantom.state === 'run-state') {
        window.setTimeout(function() {
          if (thumbnailFile) {
            page.render(thumbnailFile);
          }
          if (page.content) {
            fs.write(outfile, page.content, 'w');
          }
          // page.render("test_page.png");
          phantom.exit();
        }, timeout);
      }

    }   
  });

  page.onResourceReceived = function(response) {  
    if (response.stage == 'end'){
      return;
    }
    if (response.url == url) {
      if (response.status == 403) {
        phantom.state = 'no-session';

      }
      else {
        phantom.state = 'run-state';
        response.headers.forEach(function(header){
          system.stdout.writeLine('HEADER:' + header.name + '=' + header.value);
        });
        system.stdout.writeLine('STATUS:' + response.status);
        system.stdout.writeLine('STATUSTEXT:' + response.statusText);
        system.stdout.writeLine('CONTENTTYPE:' + response.contentType);
      }
    }
  };

  page.onLoadFinished = function(status) {
    if (status === 'success') {
      if (phantom.state == 'no-session') {
        removeCookies();
        setTimeout(runLogin, 500);
      }
    }
  };
}

if (!fs.isFile(cookie)) {
  runLogin();
}
else {
  runPage();
}
Krishna210414 commented 6 years ago

Thanks for the logic how to ensure the JS is invoked so that i can start putting the logic for redirection.