Basic Authentication Not Logging In

dhildreth commented 7 years ago

I'm attempting to crawl a password protected wiki that we use for internal documentation and I'm struggling with getting authentication to work. I've tried to use form authentication as well as basic. The wiki supports both forms of logging in, but I can't seem to get the collector to behave. I am, however, able to login using both methods in Postman and I'm able to login using basic authentication using curl. I'm hoping you can help me through this one.

Attempt using Form:

Here's the configuration I'm attempting to use:

<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
  <authMethod>form</authMethod>
  <authUsername>Joe Schmoe</authUsername>
  <authPassword>Passw0rd</authPassword>
  <authUsernameField>user</authUsernameField>
  <authPasswordField>pass</authPasswordField>
  <authURL>https://wiki.mydomain.com/tiki-login.php</authURL>
</httpClientFactory>

And the start URLs, if it matters:

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>https://wiki.mydomain.com/tiki-index.php</url>
</startURLs>

Here's the login form HTML on the page tiki-login.php:

<form name="loginbox" id="loginbox-1" action="tiki-login.php" method="post">
    <fieldset>
        <div class="user">
            <label for="login-user_1">Username:</label>
            <input type="text" name="user" id="login-user_1" size="15">
        </div>
        <div class="pass">
            <label for="login-pass_1">Password:</label>
        </div>
        <div style="text-align: center">
            <input class="button submit" type="submit" name="login" value="Log in">
        </div>
        <input type="hidden" name="stay_in_ssl_mode_present" value="y">
        <input type="hidden" name="stay_in_ssl_mode" value="y">
    </fieldset>
</form>

And here's what I see in the logs:

DEBUG [CollectorConfigLoader] Loading configuration file: cms-config.xml
Nov 02, 2017 2:39:13 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

DEBUG [XMLConfigurationUtil] Class to validate: HttpCrawlerConfig
DEBUG [XMLConfigurationUtil] Class to validate: ExtensionReferenceFilter
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
DEBUG [XMLConfigurationUtil] Class to validate: RegexReferenceFilter
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*wiki.mydomain.com.*]
DEBUG [XMLConfigurationUtil] Class to validate: ImporterConfig
DEBUG [XMLConfigurationUtil] Class to validate: FileSystemCommitter
DEBUG [XMLConfigurationUtil] Class to validate: GenericURLNormalizer
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:GenericDelayResolver[defaultDelay=3000,ignoreRobotsCrawlDelay=false,scope=crawler,schedules=[]]
DEBUG [XMLConfigurationUtil] Class to validate: GenericDelayResolver
DEBUG [XMLConfigurationUtil] Class to validate: GenericHttpClientFactory
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:StandardRobotsTxtProvider[]
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:StandardSitemapResolverFactory[sitemapPaths={/sitemap.xml,/sitemap_index.xml},lenient=false,tempDir=<null>]
DEBUG [XMLConfigurationUtil] Class to validate: StandardSitemapResolverFactory
DEBUG [XMLConfigurationUtil] Class to validate: GenericCanonicalLinkDetector
DEBUG [XMLConfigurationUtil] Class to validate: GenericLinkExtractor
INFO  [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href]}],charset=<null>,extractBetweens=[],noExtractBetweens=[]]
DEBUG [CrawlerConfigLoader] Crawler configuration loaded: Internal CMS Crawler
INFO  [AbstractCollectorConfig] Configuration loaded: id=Internal CMS Collector; logsDir=./cms-output/logs; progressDir=./cms-output/progress
INFO  [JobSuite] JEF work directory is: ./cms-output/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
DEBUG [XMLConfigurationUtil] Class to validate: FileLogManager
DEBUG [XMLConfigurationUtil] Class to validate: FileJobStatusStore
DEBUG [FileJobStatusStore] Status serialization directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress
DEBUG [FileJobStatusStore] Reading status file: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress/latest/status/Internal_32_CMS_32_Crawler__Internal_32_CMS_32_Crawler.job
DEBUG [FileJobStatusStore] Internal CMS Crawler last active time: Thu Nov 02 14:32:31 MST 2017
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
DEBUG [FileJobStatusStore] Status serialization directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress
DEBUG [FileLogManager] Log directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/logs
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.8.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2-SNAPSHOT (Norconex Inc.)
INFO  [JobSuite] Running Internal CMS Crawler: BEGIN (Thu Nov 02 14:39:14 MST 2017)
INFO  [HttpCrawler] Internal CMS Crawler: RobotsTxt support: false
INFO  [HttpCrawler] Internal CMS Crawler: RobotsMeta support: true
INFO  [HttpCrawler] Internal CMS Crawler: Sitemap support: true
INFO  [HttpCrawler] Internal CMS Crawler: Canonical links support: false
INFO  [HttpCrawler] Internal CMS Crawler: User-Agent: norconex-cms-crawler
INFO  [GenericHttpClientFactory] Performing FORM authentication at "https://wiki.mydomain.com/tiki-login.php" (username=Joe Schmoe; password=*****)
INFO  [GenericHttpClientFactory] Authentication status: HTTP/1.1 200 OK
DEBUG [GenericHttpClientFactory] Authentication response:

The response is HTML of the same login page specified in <authURL>. Continuing with the output:

...
DEBUG [GenericDocumentFetcher] Fetching document: https://wiki.mydomain.com/tiki-index.php
DEBUG [GenericDocumentFetcher] Encoded URI: https://wiki.mydomain.com/tiki-index.php
DEBUG [GenericRedirectURLProvider] URL redirect: https://wiki.mydomain.com/tiki-index.php -> https://wiki.mydomain.com/Customer-Directory-Top
DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 302 Found
INFO  [CrawlerEventManager]       REJECTED_REDIRECTED: https://wiki.mydomain.com/tiki-index.php (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (https://wiki.mydomain.com/Customer-Directory-Top)])
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference. Reference=https://wiki.mydomain.com/Customer-Directory-Top Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
DEBUG [QueueReferenceStage] Queued for processing: https://wiki.mydomain.com/Customer-Directory-Top
...

What's interesting about this is the Customer-Directory-Top page is the default page when you're not logged in. It's what you'll be redirect to as a user if you're not logged in.

Attempt using Basic:

Here's the configuration I'm attempting to use for basic authentication:

      <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
          <authMethod>basic</authMethod>
          <authUsername>Joe Schmoe</authUsername>
          <authPassword>Passw0rd</authPassword>
          <!--<authHostname></authHostname>-->
          <!--<authPort></authPort>-->
          <!--<authRealm></authRealm>-->
      </httpClientFactory>

All other configurations are the same, including the startURL. When I run it, this is the output:

DEBUG [CollectorConfigLoader] Loading configuration file: cms-config.xml
Nov 02, 2017 2:52:04 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

DEBUG [XMLConfigurationUtil] Class to validate: HttpCrawlerConfig
DEBUG [XMLConfigurationUtil] Class to validate: ExtensionReferenceFilter
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
DEBUG [XMLConfigurationUtil] Class to validate: RegexReferenceFilter
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*wiki.mydomain.*]
DEBUG [XMLConfigurationUtil] Class to validate: ImporterConfig
DEBUG [XMLConfigurationUtil] Class to validate: FileSystemCommitter
DEBUG [XMLConfigurationUtil] Class to validate: GenericURLNormalizer
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:GenericDelayResolver[defaultDelay=3000,ignoreRobotsCrawlDelay=false,scope=crawler,schedules=[]]
DEBUG [XMLConfigurationUtil] Class to validate: GenericDelayResolver
DEBUG [XMLConfigurationUtil] Class to validate: GenericHttpClientFactory
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:StandardRobotsTxtProvider[]
DEBUG [XMLConfigurationUtil] A configuration entry was found without class reference where one could have been provided; using default value:StandardSitemapResolverFactory[sitemapPaths={/sitemap.xml,/sitemap_index.xml},lenient=false,tempDir=<null>]
DEBUG [XMLConfigurationUtil] Class to validate: StandardSitemapResolverFactory
DEBUG [XMLConfigurationUtil] Class to validate: GenericCanonicalLinkDetector
DEBUG [XMLConfigurationUtil] Class to validate: GenericLinkExtractor
INFO  [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href]}],charset=<null>,extractBetweens=[],noExtractBetweens=[]]
DEBUG [CrawlerConfigLoader] Crawler configuration loaded: Internal CMS Crawler
INFO  [AbstractCollectorConfig] Configuration loaded: id=Internal CMS Collector; logsDir=./cms-output/logs; progressDir=./cms-output/progress
INFO  [JobSuite] JEF work directory is: ./cms-output/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
DEBUG [XMLConfigurationUtil] Class to validate: FileLogManager
DEBUG [XMLConfigurationUtil] Class to validate: FileJobStatusStore
DEBUG [FileJobStatusStore] Status serialization directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress
DEBUG [FileJobStatusStore] Reading status file: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress/latest/status/Internal_32_CMS_32_Crawler__Internal_32_CMS_32_Crawler.job
DEBUG [FileJobStatusStore] Internal CMS Crawler last active time: Thu Nov 02 14:39:17 MST 2017
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
DEBUG [FileJobStatusStore] Status serialization directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress
DEBUG [FileLogManager] Log directory: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/logs
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.8.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2-SNAPSHOT (Norconex Inc.)
INFO  [JobSuite] Running Internal CMS Crawler: BEGIN (Thu Nov 02 14:52:05 MST 2017)
INFO  [HttpCrawler] Internal CMS Crawler: RobotsTxt support: false
INFO  [HttpCrawler] Internal CMS Crawler: RobotsMeta support: true
INFO  [HttpCrawler] Internal CMS Crawler: Sitemap support: true
INFO  [HttpCrawler] Internal CMS Crawler: Canonical links support: false
INFO  [HttpCrawler] Internal CMS Crawler: User-Agent: norconex-cms-crawler
INFO  [SitemapStore] Internal CMS Crawler: Initializing sitemap store...
DEBUG [SitemapStore] Internal CMS Crawler: Cleaning sitemap store...
INFO  [SitemapStore] Internal CMS Crawler: Done initializing sitemap store.
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference. Reference=https://wiki.mydomain.com/tiki-index.php Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
DEBUG [StandardSitemapResolver] Sitemap locations: [https://wiki.mydomain.com/sitemap_index.xml, https://wiki.mydomain.com/sitemap.xml]
DEBUG [StandardSitemapResolver] Sitemap not found : https://wiki.mydomain.com/sitemap_index.xml
DEBUG [StandardSitemapResolver] Sitemap not found : https://wiki.mydomain.com/sitemap.xml
DEBUG [QueueReferenceStage] Queued for processing: https://wiki.mydomain.com/tiki-index.php
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Internal CMS Crawler: Crawling references...
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #1 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #2 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #3 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #4 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Processing reference: https://wiki.mydomain.com/tiki-index.php
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #5 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #6 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #7 started.
DEBUG [AbstractCrawler] Internal CMS Crawler: Crawler thread #8 started.
DEBUG [GenericDocumentFetcher] Fetching document: https://wiki.mydomain.com/tiki-index.php
DEBUG [GenericDocumentFetcher] Encoded URI: https://wiki.mydomain.com/tiki-index.php
DEBUG [GenericRedirectURLProvider] URL redirect: https://wiki.mydomain.com/tiki-index.php -> https://wiki.mydomain.com/Customer-Directory-Top
DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 302 Found
INFO  [CrawlerEventManager]       REJECTED_REDIRECTED: https://wiki.mydomain.com/tiki-index.php (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (https://wiki.mydomain.com/Customer-Directory-Top)])
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference. Reference=https://wiki.mydomain.com/Customer-Directory-Top Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
DEBUG [QueueReferenceStage] Queued for processing: https://wiki.mydomain.com/Customer-Directory-Top
DEBUG [AbstractCrawler] Internal CMS Crawler: Processing reference: https://wiki.mydomain.com/Customer-Directory-Top
DEBUG [AbstractDelay] Thread pool-1-thread-3 sleeping for 0.332 seconds.
DEBUG [Pipeline] Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@43f1762e
DEBUG [FileJobStatusStore] Writing status file: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress/latest/status/Internal_32_CMS_32_Crawler__Internal_32_CMS_32_Crawler.job
DEBUG [FileJobStatusStore] Writing status file: /var/norconex-collector-http-2.8.0-SNAPSHOT/./cms-output/progress/latest/status/Internal_32_CMS_32_Crawler__Internal_32_CMS_32_Crawler.job
DEBUG [AbstractCrawler] Internal CMS Crawler: 00:00:00.230 to process: https://wiki.mydomain.com/tiki-index.php
DEBUG [GenericDocumentFetcher] Fetching document: https://wiki.mydomain.com/Customer-Directory-Top
DEBUG [GenericDocumentFetcher] Encoded URI: https://wiki.mydomain.com/Customer-Directory-Top
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer-Directory-Top
...

Again, the interesting point for me is that it tries to get the start URL page, but gets redirected to the same Customer-Directory-Top page which is used for logging in if you're not.

I'd like to point out again, that I can get both of these methods to work using Postman. I can get basic to work using curl:

curl -v -u "Joe Schmoe":Passw0rd https://wiki.mydomain.com/tiki-listpages.php

Outputs:

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 174.79.58.232...
* Connected to wiki.mydomain.com (174.79.58.232) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 696 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: wiki.mydomain.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=wiki.mydomain.com
*        start date: Tue, 12 Sep 2017 16:03:00 GMT
*        expire date: Mon, 11 Dec 2017 16:03:00 GMT
*        issuer: C=US,O=Let's Encrypt,CN=Let's Encrypt Authority X3
*        compression: NULL
* ALPN, server did not agree to a protocol
* Server auth using Basic with user 'Joe Schmoe'
> GET /tiki-listpages.php HTTP/1.1
> Host: wiki.mydomain.com
> Authorization: Basic TmlnaHQgQ3Jhd2xlcjpHMTItdTVNXzEzN1F0
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Thu, 02 Nov 2017 21:58:14 GMT
< Server: Apache/2.4.7 (Ubuntu)
< Vary: Authorization,Accept-Encoding
< X-Powered-By: PHP/5.5.9-1ubuntu4.22
< Set-Cookie: PHPSESSID=pelkp8ogeabi1l6n25ahhot5m5; path=/
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSIDCV=aTuo5uvagMykg2Fc15U7eA%3D%3D; expires=Fri, 02-Nov-2018 21:58:14 GMT; Max-Age=31536000; path=/
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8
< 
{ [7 bytes data]
<!DOCTYPE html>
... [HTML for tiki-listpages.php] ...

The website is publicly accessible, but I'd rather not share here. If you'd like to attempt to reproduce yourself, I'm happy to supply the URL over email or private message somehow. Please let me know.

Thanks in advance for reviewing. I sure hope you have some good ideas as to what might be happening. I'm using 2.8.0-SNAPSHOT.

essiembre commented 7 years ago

Can you set the log level to DEBUG (or even TRACE) for Apache HttpClient. I think this should do it in log4j.properties:

log4j.logger.org.apache.http=TRACE

The objective is to get the details of the HTTP authentication attempts in the logs. You can then attach them here. It could be that the server expects a specific value in the HTTP request headers that browsers are sending but the crawler is not. If that's the case you can add those missing header values.

If that does not help, you can email me your site URL so I can try to reproduce (with temporary credentials would be best, if possible).

dhildreth commented 7 years ago

Once again, your help is very much appreciated. I'm going to try and get some temporary credentials setup for you so you can attempt to reproduce. In the meantime, I enabled apache.http=TRACE logging. It didn't seem to add any additional information though. Maybe you need a trained eagle eye to see anything different? Anyways, I'm attaching the logs for both form and basic authentication attempts.

Internal_32_CMS_32_Crawler.basic.log Internal_32_CMS_32_Crawler.form.log

Not to muddy up the water, but there is one interesting piece. Looking at the form authentication HTML output, the username and password are included in the "fullscreen" link as if they were GET URL parameters.

<a title="Fullscreen" href="/tiki-login.php?user=Joe+Schmoe&amp;pass=Passw0rd&amp;fullscreen=y"><img src="img/icons/application_get.png" alt="Fullscreen" width="16" height="16" class="icon" /></a>

I also noticed somewhere along the line (in Chrome dev tools probably) that there were a couple headers being sent, so I added them to my config file. Didn't seem to make any difference.

<headers>
  <header name="stay_in_ssl_mode_present">y</header>
  <header name="stay_in_ssl_mode">y</header>
  <header name="login">Log in</header>
</headers>

essiembre commented 7 years ago

While I am not sure why form-based authentication cannot be replicated with the Collector, I found out "preemptive" authentication works when using "basic" authentication. So now the latest snapshot supports a new configuration option on the GenericHttpClientFactory:

<authPreemptive>true</authPreemptive>

Please confirm that does it for you as well.

dhildreth commented 7 years ago

You're amazing! That worked fine, and I'm okay with basic authentication. :-)

Closing the issue. Thanks again!

Norconex / crawlers

Basic Authentication Not Logging In #420

Attempt using Form:

Attempt using Basic: