codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

How to index secured page(via Forms authentication) using Elastic Search service #44

Open srinivasv2 opened 10 years ago

srinivasv2 commented 10 years ago

Hi geeks,

I have a requirement to index secured pages via Forms authentication using elastic search. I have used BASIC authentication feature provided in this plugin which didn't worked for me. Please provide any suggestions.

Thanks, Srinivas V

marevol commented 10 years ago

To support Form authentication, I think that other ways are needed. If you can not bypass the authentication, for example, one of answers is to use a reverse proxy with authentication, such as HP IceWall SSO(it's not OSS product...). The reverse proxy log in to a site with Form authentication automatically, and then passes the contents to a crawler.

Fapiko commented 10 years ago

Eventually I will toss this in a public repository but if you're still looking for a solution for this I've made a gist with a small python script I wrote that uses mitmproxy to establish a login session and the appropriate cookies to all requests going through it. Right now I'm just using it to crawl our internal confluence server but eventually I plan to expand it out to work with multiple hostnames and rotating session ids: https://gist.github.com/Fapiko/d3ecfbd58ab156541da9

You'll need to add the mitmproxy ca cert to your java cacerts keystore if you're operating on something that is over SSL.