Closed dhildreth closed 7 years ago
Can you set the log level to DEBUG (or even TRACE) for Apache HttpClient. I think this should do it in log4j.properties:
log4j.logger.org.apache.http=TRACE
The objective is to get the details of the HTTP authentication attempts in the logs. You can then attach them here. It could be that the server expects a specific value in the HTTP request headers that browsers are sending but the crawler is not. If that's the case you can add those missing header values.
If that does not help, you can email me your site URL so I can try to reproduce (with temporary credentials would be best, if possible).
Once again, your help is very much appreciated. I'm going to try and get some temporary credentials setup for you so you can attempt to reproduce. In the meantime, I enabled apache.http=TRACE
logging. It didn't seem to add any additional information though. Maybe you need a trained eagle eye to see anything different? Anyways, I'm attaching the logs for both form and basic authentication attempts.
Internal_32_CMS_32_Crawler.basic.log Internal_32_CMS_32_Crawler.form.log
Not to muddy up the water, but there is one interesting piece. Looking at the form authentication HTML output, the username and password are included in the "fullscreen" link as if they were GET URL parameters.
<a title="Fullscreen" href="/tiki-login.php?user=Joe+Schmoe&pass=Passw0rd&fullscreen=y"><img src="img/icons/application_get.png" alt="Fullscreen" width="16" height="16" class="icon" /></a>
I also noticed somewhere along the line (in Chrome dev tools probably) that there were a couple headers being sent, so I added them to my config file. Didn't seem to make any difference.
<headers>
<header name="stay_in_ssl_mode_present">y</header>
<header name="stay_in_ssl_mode">y</header>
<header name="login">Log in</header>
</headers>
While I am not sure why form-based authentication cannot be replicated with the Collector, I found out "preemptive" authentication works when using "basic" authentication. So now the latest snapshot supports a new configuration option on the GenericHttpClientFactory
:
<authPreemptive>true</authPreemptive>
Please confirm that does it for you as well.
You're amazing! That worked fine, and I'm okay with basic authentication. :-)
Closing the issue. Thanks again!
I'm attempting to crawl a password protected wiki that we use for internal documentation and I'm struggling with getting authentication to work. I've tried to use form authentication as well as basic. The wiki supports both forms of logging in, but I can't seem to get the collector to behave. I am, however, able to login using both methods in Postman and I'm able to login using basic authentication using
curl
. I'm hoping you can help me through this one.Attempt using Form:
Here's the configuration I'm attempting to use:
And the start URLs, if it matters:
Here's the login form HTML on the page tiki-login.php:
And here's what I see in the logs:
The response is HTML of the same login page specified in
<authURL>
. Continuing with the output:What's interesting about this is the Customer-Directory-Top page is the default page when you're not logged in. It's what you'll be redirect to as a user if you're not logged in.
Attempt using Basic:
Here's the configuration I'm attempting to use for basic authentication:
All other configurations are the same, including the startURL. When I run it, this is the output:
Again, the interesting point for me is that it tries to get the start URL page, but gets redirected to the same Customer-Directory-Top page which is used for logging in if you're not.
I'd like to point out again, that I can get both of these methods to work using Postman. I can get basic to work using
curl
:Outputs:
The website is publicly accessible, but I'd rather not share here. If you'd like to attempt to reproduce yourself, I'm happy to supply the URL over email or private message somehow. Please let me know.
Thanks in advance for reviewing. I sure hope you have some good ideas as to what might be happening. I'm using 2.8.0-SNAPSHOT.