TeamHG-Memex / aquarium

Splash + HAProxy + Docker Compose
MIT License
198 stars 41 forks source link

How to disable HAProxy authentication #6

Open onurakman opened 7 years ago

onurakman commented 7 years ago

How can I disable HAProxy authentication (by manully editing haproxy.cfg) because of HAProxy passes its authentication info to the site and the site returns 401

candale commented 7 years ago

What I did is commented the following in the haproxy.cfg file:

...
# Splash Cluster configuration
frontend http-in
    bind *:8050
#
#    http basic auth
#    acl auth_ok http_auth
#    http-request auth realm Splash if !auth_ok
#    http-request allow if auth_ok
#    http-request deny
#
   # don't apply the same limits for non-render endpoints
    acl staticfiles path_beg /_harviewer/
    acl misc path / /info /_debug /debug

    use_backend splash-cluster # if auth_ok !staticfiles !misc
    use_backend splash-misc # if auth_ok staticfiles
    #  use_backend splash-misc # if auth_ok misc
...
nirvana-msu commented 6 years ago

This is incredibly annoying.. These credentials are meant to be private, so one could limit access to Splash instance facing the internet. Instead these credentials get forwarded to every website you crawl. Unless you additionally use a proxy, you're effectively letting everyone know where your Splash instance is and what its credentials are... Why is it even done this way? Clearly looks like a bug to me.

P.S. To test this, you can just crawl https://httpbin.org/headers and check response body (it simply mirrors the headers). You'll see your Splash credentials in Authorization header.

Is there a workaround to still use HTTP Basic Auth for Splash, but do not pass these credentials onto the website you crawl?

UPDATE Ok, I solved this by setting Authorization header in request.meta['splash']['splash_headers'], instead of directly in request headers as done by HttpAuthMiddleware. I believe that advice to use HttpAuthMiddleware is very dangerous and should be removed from documentation / README. The correct way is clearly to set these credentials via splash_headers.

chipzzz commented 3 years ago

@nirvana-msu setting those got me past the 401 code, THANK YOU. although, it looks like it still does not go through scraper api proxy :(