Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Documentation for Auth - OKTA SSO #421

Open jacksonp2008 opened 6 years ago

jacksonp2008 commented 6 years ago

Testing this product, so far so good. I will have a number of sites that use OKTA SSO which I will need to crawl. Any pointers on how to do this?

essiembre commented 6 years ago

I am not too familiar with OKTA SSO, but I can see they support a few authentication methods. Maybe one is supported by the GenericHttpClientFactory. If not, you can extend this class to provide your custom authentication mechanism using OAuth 2.0 and carrying your security token around. You can have a look at https://www.norconex.com/how-to-crawl-facebook/ which does something similar.

If the authentication for your sites is standard/generic enough and you can share credentials for a sample site, maybe we'll be able to add built-in support for another authentication scheme.

jacksonp2008 commented 6 years ago

Thanks. OKTA is not social auth, rather an application for enterprises to provide single sign-on via SAML for web applications. https://developer.okta.com/use_cases/integrate_with_okta/sso-with-saml

It may be possible to send username/password, and 2FA. I'll do some research on this on the OKTA side.

Can you please point me to how I would extend GenericHttpClientFactory for this? Generally, how would I send a simple username/password to a site and then crawl it?

essiembre commented 6 years ago

I did not mean to suggest OKTA was a social auth. The link to the blog post was to point you to an example extending the Collector. There is an example of a class extending GenericDocumentFetcher showing one way you can pass a token with every URL requests. On second thought, it may be the best class to overwrite since that is actually where the HTTP requests are happening.

Extending GenericHttpClientFactory could be good if you know there is a way to add "default" SAML authentication (or other supported auth) on Apache HttpClient.

It seems like OKTA provides a Java API which should make your life easier: https://developer.okta.com/code/java/index

If you have a way to provide me with a protected URL with a temporary test account (sent by email), we could make adding SAML support a feature request if you like.

jacksonp2008 commented 6 years ago

Very interested and thank-you! -- I need to do some work on my end.

OKTA does offer a free test account, might help. I am totally open to getting you access to test.

I've got some basic auth sites to figure out first... tried one below, inside <httpcollector... I have:

  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <authUsername>super secret account</authUsername>
      <authPassword>super secret password</authPassword>
      <authUsernameField>super secret account</authUsernameField>
      <authPasswordField>super secret password</authPasswordField>
   <authMethod>form</authMethod>
  </httpClientFactory>

Cred's are good, but it returns:

INFO  [CrawlerEventManager]       REJECTED_BAD_STATUS: https://updates.forescout.com/ (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Authorization Required])

Must be missing something. tried <authMethod> form basic digest

no love, but very close.

essiembre commented 6 years ago

Can you share test credentials for that one too? From a very quick look at the site, it seems to be using "basic" (or maybe "digest"). I do not know if that's related, but there is a new flag that fixes basic auth issues for some people, described here: https://github.com/Norconex/collector-http/issues/420#issuecomment-342708772

Here is the flag:

<authPreemptive>true</authPreemptive>

Give it a try and let me know.

essiembre commented 6 years ago

Marking as a feature-request to support SAML.

jacksonp2008 commented 6 years ago

Just getting back to this, thank-you Pacal. I did add the authpreemptive but it didn't fix. I'm not sure how to debug further, is there a way to enable verbose logging so I can see what's going on behind the scenes. If I can't figure it out, I may take you up on your kind offer to login.

jacksonp2008 commented 6 years ago

Just fyi, I setup postman and did a "basic" auth request with the credentials and it works fine but with the crawler it fails.

Here is my config:

<httpcollector id="Minimum Config HTTP Collector">
  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <authUsername>xxxxx</authUsername>
      <authPassword>xxxxxxx</authPassword>
      <authUsernameField>xxxxx</authUsernameField>
      <authPasswordField>xxxxxx</authPasswordField>
      <authPreemptive>true</authPreemptive>
      <trustAllSSLCertificates>true</trustAllSSLCertificates>
      <authMethod>basic</authMethod>
      <authURL>https://updates.forescout.com</authURL>
  </httpClientFactory>

and log:


INFO  [SitemapStore] Anonymous Coward: Initializing sitemap store...
INFO  [SitemapStore] Anonymous Coward: Done initializing sitemap store.
log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.ResponseProcessCookies).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
INFO  [StandardSitemapResolver] Resolving sitemap: https://updates.forescout.com/sitemap.xml
INFO  [StandardSitemapResolver]          Resolved: https://updates.forescout.com/sitemap.xml
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Anonymous Coward: Crawling references...
INFO  [CrawlerEventManager]       REJECTED_BAD_STATUS: https://updates.forescout.com/ (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Authorization Required]
essiembre commented 6 years ago

To get the maximum verbosity set the following to TRACE in log4j.properties:

log4j.logger.org.apache.http=TRACE

If you do not mind sending me temporary credentials via email, I will look into it when I get a chance.

essiembre commented 6 years ago

Having a second look at your log, I see it rejects the authentication instead of attempting it. Odd... maybe as a test you can try to set the following in your document fetcher:

<validStatusCodes>200,401</validStatusCodes>

Let me know if that makes a difference.

jacksonp2008 commented 6 years ago

Sorry, that didn't do it, although I do have a clue.

If I use curl:

curl -u user:pas$word https://updates.forescout.com

if fails with the 401 even though the creds are good.

Add \ in front of $ and it works:

curl -u user:pas\$word https://updates.forescout.com

Is it possible that we are seeing an issue with special characters in the password?

I did try several ways, none worked out:

'pas$word' "pas$word" pas\$word

thanks

essiembre commented 6 years ago

Interesting... a possibility for sure. Are you storing your password in the Collector XML config? If so, did you try escaping the $ with a backslash in the config? It may be interpreted as a variable. If that's the case and it works with escaping, you have other options too. You can define the password in a variables/properties file and reference it as a variable in the config, or you can encrypt the password (see online documentation).

If that does not make a difference, it may need to be escaped by the Collector somehow before sending it to your server and will require more investigation.

As a workaround solution, does it work if you change the password to one without $ in it? Just to 100% confirm the issue is with the $.

essiembre commented 6 years ago

Good news: it is working as expected and I was able to make it work without anything special. It turns out you have put <httpClientFactory> under <httpcollector> while it goes under your <crawler> section (as per documentation). Moving it there did it.

Please have a try and confirm.

jacksonp2008 commented 6 years ago

Awesome news, it ran here as well, thanks for all your help on this Pascal!

kristiWabion commented 5 years ago

Hello, any update on using SSO Auth with Norconex?

jacksonp2008 commented 5 years ago

Hi Kristi

On this end (customer side) we never got it settled, but still an issue to be resolved. We are firing this project back up and plan to focus more in Q1 starting Jan.

Regards,

-Steve

(415) 320-1102 https://www.google.com/voice/#phones

On Fri, Nov 9, 2018 at 1:56 AM kristiWabion notifications@github.com wrote:

Hello, any update on using SSO Auth with Norconex?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/421#issuecomment-437309260, or mute the thread https://github.com/notifications/unsubscribe-auth/ABurT4-5WpFOAcu8OYXoM71hfAgeJOQ0ks5utVFYgaJpZM4QSPB5 .