ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.39k stars 135 forks source link

Allow serving the dashboard with https:// #83

Open rwoodpecker opened 8 years ago

rwoodpecker commented 8 years ago

I know this is a bit of an indulgent question, but I'm looking to monitor the dashboard on a fairly restrictive network and am keen for the dashboard of grab-site to use HTTPS and so I don't have to use a VPN just for the sake of the dashboard.

I've tried a few different reverse proxies using NGINX and Caddy but I just seem to be served a blank page from grab-site. I figure the (python?) webserver grab-site uses might not be supported too well to be reverse proxied by a traditional web server, but that's a bit beyond my scope. I just wanted to see if anyone had managed to get this to work? Or is it trivial to insert a SSL certificate to be used directly by the grab-site dashboard?

Thanks!

ivan commented 8 years ago

I use an ssh tunnel for this:

autossh -f -C -L 127.0.0.1:29000:127.0.0.1:29000 user@hostname -N

Remember to use GRAB_SITE_INTERFACE=127.0.0.1 as well when running gs-server to avoid leaking the dashboard to everyone.

If an SSH tunnel does not work for your use case, try looking at the nginx error log (and perhaps increasing verbosity?). Since grab-site 0.11, HTTP/1.0 does not work and you might need to configure nginx to use HTTP/1.1 for talking to the backend webserver (grab-site).

(2017 edit: I now use WireGuard instead of SSH tunnels and I recommend it.)

ivan commented 8 years ago

Hmm, actually, an nginx setup might be tricky because you would also have to reverse-proxy the WebSocket endpoint: http://nginx.org/en/docs/http/websocket.html - which I have not tested at all.

ivan commented 8 years ago

It shouldn't be too difficult to add SSL support to gs-server. It looks like create_server here just needs to get passed an SSL context: https://docs.python.org/3/library/asyncio-eventloop.html#creating-listening-connections

This would not provide as much security as an SSH tunnel, though, because the SSL security model is broken without restrictive CA or certificate pinning.

rwoodpecker commented 8 years ago

I'm not really too concerned about the connection getting MITMed or anything like that with my own certificate (and I'm sure/hope most users wanting to use SSL would be aware of this limitation).

Direct support in gs-server would be lovely! I'l definitely look into the SSH tunnel though, thanks.

12As commented 8 years ago

I've been looking at adding TLS support to grab-site. I believe that it would require 3 things to work:

  1. Add an SSLContext to gs-server after it is passed the appropriate certificate file.
  2. Add a check to wpull_hooks.py (probably an environmental variable) so that the method connect_to_server uses an SSLContext on the actual connection, if detected.
  3. Add a check to the dashboard.html to set the websocket protocol to ws: or wss: depending on whether location.protocol is http: or https:, respectively
ivan commented 8 years ago

That sounds about right.

Users should be able to have gs-server listen on both TCP and SSL since they might want to avoid doing SSL between grab-site instances and gs-server, or because they have old TCP-only crawls running.

dashboard's ?host= might need to renamed to ?ws= and take either a ws:// or wss:// URI.

12As commented 8 years ago

Python is not low enough level to listen for both SSL and non-SSL on the same port. Additionally, you cannot listen on both 0.0.0.0 and 127.0.0.1 on the same port at the same time.

Solutions seem to be 1.) implementing https://github.com/ludios/grab-site/issues/86 in conjunction with manually managing the Ethernet interface address or using unix sockets. Or 2.) using python 3.5 which should, according to the docs (I haven't actually tested this), allow 2 programs to listen on the same port when set up properly, but even then that is a gamble as to which program the kernel connects you to. And even then you would need to --upstream to the other gs-server.

ivan commented 8 years ago

I never meant that the TCP and SSL listener have to be on the same port, just that it should be possible to have both of them running on different ports.

Ghabry commented 4 years ago

This is not very hard to achieve with nginx and works with https. The following config makes grab-site/ available at "your-address/grab-site/"

location /grab-site/ {
    proxy_pass http://localhost:29000/;
    proxy_set_header Host $http_host;

    # These 3 lines are optional
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Authorization "";

    # Required for the web socket
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

Security warning: Not authorized! You should use something like this for security.

auth_basic "Login required";
auth_basic_user_file /etc/htpasswd;

You also have to patch libgrabsite/dashboard.html. Maybe this should be upstreamed. I see no disadvantages doing this: The path for the websocket is wrong. Will be hostname/stream.

Change line 1455 from this.host = location.host to this.host = location.toString().replace(/^.*\/\//, "") then the web socket will connect to "your-host/grab-site/stream" and work :)

acrois commented 3 years ago

I referred to this in #192.

I believe the quickest way to do it with most flexibility would be to use nginx sitting in front of the dashboard. It wouldn't need very little change to the application, only need to document its usage in an example.

An upstream patch that dynamically figures out what the correct host name is best. However, I just wanted to throw another idea out there and show that it isn't strictly required for support to be achieved.

You can also use something like sub filter which also works with other apps, so you can take this concept and apply it to other apps in a similar way.

That's not to say there aren't any benefits to TLS termination within the server but I don't think dashboard would benefit from complicating the implementation with direct support, at this time.