Open shubhamsamy opened 8 years ago
try this:
<metadataFetcher class="${metaFetcher}" >
<validStatusCodes>200,301,302</validStatusCodes>
</metadataFetcher>
Hi, I have tried to add valid status codes but still I am getting the same error. I am attaching the configuration. I have removed site name etc as these are intranet site. Please have a look and let me know if there any thing missing in my configuration. Thanks & Regards, Sam crawler.txt
Without ways to reproduce it is hard for me to comment, but looking at your log snippet, nothing indicates the redirect is not being crawled. Understanding these two lines from your log may help:
CrawlerIbnstance1: 2016-01-12 11:46:55 INFO - REJECTED_REDIRECTED: http://abc.xyz.com/Download?docid=123&Status=FREE (Subject: HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4)])
CrawlerIbnstance1: 2016-01-12 11:46:55 DEBUG - Queued for processing: http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4
REJECTED_REDIRECTED means the original URL is being dropped in favor of the target URL. That target URL will be crawled unless it is rejected by some other rules you have defined in your config.
The line saying Queueued for processing...
tells you the target URL will be processed.
Further in your logs you should have indications whether it was indeed processed or not, and reasons why not if the case.
Does this help?
Hi Pascal, Thanks for your input. Page is queued and never get crawled as it is redirecting to a site which uses SiteMinder Authentication. Please let us know if there is plan to add the feature to support SiteMinder Authentication. Thank & Regards, Sam
Do you by any chance have or know of a public login form with a test/demo account we can use to start on this?
Hi Pascal, I am sorry as these are intranet sites and are not available outside. Regards, Praveen
I am marking this as a feature request.
It will likely remain open until we can get our hands on a public SiteMinder site we can use for testing/implementing this.
You can always contact Norconex to have someone work on your intranet to put this in place.
Hi Pascal, do you have any update regarding SiteMinder Authentication? Thanks.
Hello Akshay, No update. Do you have a SiteMinder site with temporary access so we can give it a try?
Hi Any Update on this will be able to crawl siteminder authenticated url ? I was also facing same issue any help would be greatly appreciated.
@Krishna210414 , the issue is the same: we need a sample SiteMinder site we can use as a test. You got one you can share?
Nope . I don't have one to share on the forum , But i want to know is there specific setting with which i will be able to crawl redirected url ?
It would be helpful if you can provide the way to pass targeted url as parameter to the authentication url.
@shubhamsamy As I experienced, httpClientFactory login seems to have limited capabilities. Understandably there are so many different auth methods including federated login. I tried it with sites built upon Drupal which uses a regular form auth. If it doesn't work with your intranet, probably it is because it hops pages to get authenticated. In this case, using PhantomJS seems to the best bet with Norconex for now. I was able to crawl through both FORM and SAML auth. After all PhantomJS is a headless browser; it seems to take a bit too much hack esp. for SAML auth.
Thanks for the reply.Can you share your logic ?
@Krishna210414 I am not sure if you have the same issue as @shubhamsamy does. If you're dealing with httpClientFactory, have you tried the following config?
<metadataFetcher class="$metaFetcher">
<validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher>
i tried with that it didnt work
If you're trying to do Form Auth, you can configure:
<httpcollector id="My Collector">
<crawler id="$crawler-id">
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>$crawler-url</url>
</startURLs>
<metadataFetcher class="$metaFetcher">
<validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher>
<documentFetcher class="${http}.fetch.impl.PhantomJSDocumentFetcher"
detectContentType="true" detectCharset="true" screenshotEnabled="true">
<exePath>${run-path}</exePath>
<scriptPath>${script-path}</scriptPath>
<resourceTimeout>5000</resourceTimeout>
<validStatusCodes>200,302,403</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
<referencePattern>^https://.*</referencePattern>
<renderWaitTime>3000</renderWaitTime>
<screenshotDimensions>600x400</screenshotDimensions>
<screenshotZoomFactor>0.25</screenshotZoomFactor>
<screenshotScaleDimensions>300</screenshotScaleDimensions>
<screenshotScaleStretch>false</screenshotScaleStretch>
<screenshotScaleQuality>medium</screenshotScaleQuality>
<screenshotImageFormat>png</screenshotImageFormat>
<screenshotStorage>disk</screenshotStorage>
<screenshotStorageDiskDir structure="url2path">${workdir}/screenshot</screenshotStorageDiskDir>
<screenshotStorageDiskField>dummy</screenshotStorageDiskField>
</documentFetcher>
......
</crawler>
</crawlers>
</httpcollector>
/**
* This file is used and is required by the PhantomJSDocumentFetcher.
* Modifying this file could break PhantomJSDocumentFetcher behavior.
*/
var webPage = require('webpage');
var page;
var loginPage;
var fs = require('fs');
var system = require('system');
// Phantomjs global config
phantom.cookiesEnabled = true;
phantom.javascriptEnabled = true;
phantom.state = 'no-state';
//#############
// Local config
// ############
var loginAttempt = 0;
var userName = "username";
var userPass = "password";
var workDir = '/path/to/work/dir';
// Define session cookie file
// in order for PhantomJS to keep a session alive
// make it sure to be writable
var cookie = workDir + '/cookies/cookie.json';
var loginUrl = 'https://example.com/login'; // site login link where a login form presents
var logoutUrl = 'https://example.com/logout'; // site logout link
if (system.args.length !== 10) {
system.stderr.writeLine('Invalid number of arguments.');
phantom.exit(1);
}
var url = system.args[1]; // The URL to fetch
var outfile = system.args[2]; // The temp output file
var timeout = system.args[3]; // How long to wait for the whole page to render
var bindId = system.args[4]; // HttpClient binding id
var protocol = system.args[5]; // Was the original URL "https" or "http"?
var thumbnailFile = system.args[6]; // Optional path to image file
var dimension = system.args[7]; // e.g. 1024x768
var zoomFactor = system.args[8]; // e.g. 0.25 (25%)
var resourceTimeout = system.args[9]; // timeout for a single page resource
var addCookieInfo = function() {
Array.prototype.forEach.call(JSON.parse(fs.read(cookie)), function(param) {
phantom.addCookie(param);
});
};
var removeCookies = function() {
if (fs.exists(cookie)) {
fs.remove(cookie);
}
if (loginPage === 'object') {
loginPage.close();
}
loginPage = webPage.create();
loginPage.open(logoutUrl, function(status) {
if (status === "success") {}
});
}
function runLogin() {
if (loginPage === 'object') {
loginPage.close();
}
if (loginAttempt > 2) {
system.stderr.writeLine('Reached max login attempt.');
phantom.exit();
}
else {
loginAttempt++;
loginPage = webPage.create();
loginPage.open(loginUrl, function(status) {
if (status === "success") {
// system.stderr.writeLine('Form auth started.');
/**
* #############################################
* NOTE: Login Form
* Customize for UserID, Password, and Form fields
* Or rewrite to pass each objects to this function
* #############################################
*/
loginPage.evaluate(function(uname, upass) {
document.getElementById("username").value = uname;
document.getElementById("userpass").value = upass;
document.getElementById("loginform").submit();
//docForm = document.getElementsByTagName("form");
//docForm[0].submit();
}, userName, userPass);
loginPage.onLoadFinished = function(status) {
if (status === 'success') {
if (!phantom.state || phantom.state == 'no-state') {
phantom.state = 'no-session';
}
if (phantom.state === 'no-session') {
fs.write(cookie, JSON.stringify(phantom.cookies), "w");
phantom.state = 'run-state';
setTimeout(runPage, 500);
}
}
};
}
});
}
}
/**
* Set varabiles with Norconex options
*/
function setPage() {
page.onResourceError = function(resourceError) {
system.stderr.writeLine(resourceError.url + ': ' + resourceError.errorString);
};
if (thumbnailFile && dimension) {
var pageWidth = 1024;
var pageHeight = 768;
if (dimension) {
var size = dimension.split('x');
pageWidth = parseInt(size[0], 10) * zoomFactor;
pageHeight = parseInt(size[1], 10) * zoomFactor;
}
page.viewportSize = { width: pageWidth, height: pageHeight };
page.clipRect = { top: 0, left: 0, width: pageWidth, height: pageHeight };
}
if (thumbnailFile && zoomFactor) {
page.zoomFactor = zoomFactor;
}
if (bindId !== "-1") {
page.customHeaders = {
"collector.proxy.bindId": bindId,
"collector.proxy.protocol": protocol
};
}
if (resourceTimeout !== "-1") {
page.settings.resourceTimeout = resourceTimeout;
}
}
function runPage() {
if (page === 'object') {
page.close();
}
page = webPage.create();
addCookieInfo();
setPage();
page.open(url, function(status) {
if (status !== 'success') {
system.stderr.writeLine('Unsuccessful loading of: ' + url + ' (status=' + status + ').');
system.stderr.writeLine('Content: ' + page.content);
if (page.content) {
fs.write(outfile, "error", 'w');
}
phantom.exit();
}
else {
if (phantom.state === 'run-state') {
window.setTimeout(function() {
if (thumbnailFile) {
page.render(thumbnailFile);
}
if (page.content) {
fs.write(outfile, page.content, 'w');
}
// page.render("test_page.png");
phantom.exit();
}, timeout);
}
}
});
page.onResourceReceived = function(response) {
if (response.stage == 'end'){
return;
}
if (response.url == url) {
if (response.status == 403) {
phantom.state = 'no-session';
}
else {
phantom.state = 'run-state';
response.headers.forEach(function(header){
system.stdout.writeLine('HEADER:' + header.name + '=' + header.value);
});
system.stdout.writeLine('STATUS:' + response.status);
system.stdout.writeLine('STATUSTEXT:' + response.statusText);
system.stdout.writeLine('CONTENTTYPE:' + response.contentType);
}
}
};
page.onLoadFinished = function(status) {
if (status === 'success') {
if (phantom.state == 'no-session') {
removeCookies();
setTimeout(runLogin, 500);
}
}
};
}
if (!fs.isFile(cookie)) {
runLogin();
}
else {
runPage();
}
Thanks for the logic how to ensure the JS is invoked so that i can start putting the logic for redirection.
Hi, I am not able to crawl the redirected URL. I need to crawl a reference URL in the page which is being redirected to other site. I have attached snippet from log which tells that the crawl stage is 'Redirect', status code is 302 as show below: crawlState=REDIRECT, statusCode=302, reasonPhrase=Found
Please have a look and let us know as what could be the reason for this.
Regards, Sam redirect log.txt