WebCuratorTool / webcurator-v2-legacy

The Web Curator Tool is a tool for managing the selective web harvesting process. (moved from SourceForge). https://webcurator.slack.com https://webcuratortool.readthedocs.io
http://dia-nz.github.io/webcurator/
Apache License 2.0
27 stars 12 forks source link

Problems with a http reverse proxy #73

Open morpheus-87 opened 5 years ago

morpheus-87 commented 5 years ago

At the moment, there are problems, when you have a http reverse proxy like NginX in front of the wct and want to use ssl.

The browser always complains about mixed content and it is not even possible to log into the system.

The problem is, that a base path for the relative urls is used in all of the templates, e.g.: https://github.com/DIA-NZ/webcurator/blob/master/wct-core/src/main/webapp/logon.jsp#L3

It is not possible to define the value, because it is always read from the request.

hannakoppelaar commented 5 years ago

Hi @morpheus-87, I guess the problem is that WCT generates absolute URLs that direct the browser around your proxy once it starts to load resources included in the page?

A solution would be to have WCT generate relative URLs (don't know what the technical ramifications of that change would be right now).

A workaround for now would be to have nginx rewrite the page, using sub_filter to dynamically rewrite those absolute http URLs to https URLs or relative URLs. Would that be feasible?

morpheus-87 commented 5 years ago

Is this issue fixed in the latest release, @hannakoppelaar?

hannakoppelaar commented 5 years ago

@morpheus-87 No, this is one is still open.

morpheus-87 commented 5 years ago

The workaround is working great, thanks for the hint, @hannakoppelaar.

jmvezic commented 4 years ago

I have a similar problem with WCT, but using the Apache proxy in front of WCT. If I want to use Apache to access WCT without the ":8080" part, the built-in browse tool doesn't work.

Using:

ProxyPass / http://localhost:8080/ ProxyPassReverse / http://localhost:8080/

the browse tool says "http://harvested.domain/ cannot be found." Browse tool works only when I specifically use the :8080 in the URL.

obrienben commented 4 years ago

@jmvezic Have you also adjusted the browseHelper.prefix setting in wct-core.properties? This might also need to reflect your changes, it is used for rewriting the links.

I couldn't see how this could be achieved (other than just running Tomcat on port 80). Any ideas @hannakoppelaar ?

As an aside - I would recommend setting up OpenWayback if you can and using that instead of the built-in browse tool, which is very old and tends to leak to the live web quite a lot.

jmvezic commented 4 years ago

@obrienben yes, I tried that setting with both the 8080 specified and without, it always says "cannot be found". What's weird to me is that Apache should be proxying it to 8080 nonetheless, it's as if the browse tool recognizes the port as part of the resource ID or something? I'm not sure.

What happens is (with the Apache proxypass 80 -> 8080 set):

As for OpenWayback, that will definitely be the final solution, public-facing. The idea was to use the built-in browse tool just to make sure the site was harvested without major issues, so I was trying to avoid two instances of OpenWayback (one for QA, one for live website viewing).

obrienben commented 4 years ago

@jmvezic I don't think the tool is using port 8080 specifically, I found if you switch Tomcat and all the WCT properties to port 80, it browses harvests fine.

My advice would be not to rely on the built-in viewer as the only QA tool. In my experience it tends to hide web harvest problems because it is bad at leaking to the live web. At NLNZ we use OWB (with BDB indexing to keep it simple) for QA in WCT, and I know others are using OWB and PyWB.