ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
84 stars 60 forks source link

Relative URLs #96

Open jwithamfugro opened 1 year ago

jwithamfugro commented 1 year ago

Similar request to #27

We have an ERDDAP instance running on a private network machine which cannot be added to DNS. Because we want this machine to have a domain, we've set the baseUrl to the domain we want to use anyway (e.g. "http://erddap.localdomain:8080") and ask each user to modify their LMHOSTS and HOSTS file so their browsers recognise it and can properly navigate ERDDAP.

Obviously, there are issues when someone on the network who doesn't have this localdomain setup tries to navigate ERDDAP from direct IP, as webpage links use the baseUrl. With DNS out of the question, we are considering if it's possible for links on webpages to use relative URLs rather than baseUrl, so that it doesn't matter what name you are using to access ERDDAP. Link sharing isn't that much of a concern, as contradictory as it may sound.

We tried #27's little hack and made the baseUrl "/.", and while that does make webpage links resolve to the domain the user is accessing ERDDAP from, it's an incorrect configuration that causes issues with at least one other part of ERDDAP; namely the URL generation on tabledaps treating the baseUrl as literal and outputting URLs which cannot be used by anyone (eg: "/./erddap/tabledap/datasetname.htmlTable"). To avoid issues like that, we would rather prefer the ability to make webpages respect the domain the user is accessing from regardless of what the baseUrl is.

srstsavage commented 1 year ago

+1, my vote would be a boolean setting to enable ERDDAP to use the values of X-Forwarded-Host, X-Forwarded-Port, and X-Forwarded-Proto to build response URLs, fall back to the Host header if not present, and fall back again to the baseUrl setting if no headers are present.

ChrisPJohn commented 1 year ago

There is code that relies on the baseUrl being set, and using '/.' will likely cause strange behavior, so I agree that's not a good solution. Using relative URLs (or the headers) should be possible, though it will require updating every place that generates URLs individually.

BobSimons commented 1 year ago

The original decision to use absolute rather than relative urls was chance, but I think there are a few places where the absolute urls were very advantageous (sorry, I can't think of where).

If you decide to switch, it should be easy (famous last words), because all uses of baseUrl and baseHttpsUrl are funneled through 1 or 2 methods in EDStatic.

If you switch to relative, one situation to be aware of is all the places where private info is being passed (signing up for subscriptions, logging in, etc). These places rely on a method in EDStatic that pushes people to https.

BobSimons commented 1 year ago

I thought of one of the situations where absolute URLs are an advantage: in a grid of ERDDAPs https://coastwatch.pfeg.noaa.gov/erddap/download/grids.html#grids, this allows the secondary (not officially public) ERDDAPs to push users back to the main, public ERDDAP. I think there are other cases where absolute urls are useful, but can't remember.

So maybe the best solution is to support both absolute and relative urls. This could be decided by the administrator by either supplying (for absolute) or not supplying (for relative) the baseUrl and baseHttpsUrls. Since the construction of the start of the URLs are handled by 1 or 2 methods in EDStatic, a few small changes there should allow for both options. That has the minor secondary benefit of leading to no change for administrators (unless they want relative urls) and no change for users (e.g., if any are screen scraping URLs).

turnbullerin commented 1 year ago

Looking at other web frameworks, this is what they typically do:

  1. Generate relative URLs (relative to site root, so starting with /) by default
  2. Allow methods that generate URLs to request an absolute URL instead (e.g. absolute=TRUE) to override this where it can be useful (notably where the URL is going to be used outside of a web context, like in making an XML metadata document for instance).
  3. Allow ERDDAP installations to configure their default values for absolute URL construction in setup.xml (this will provide defaults for absolute URLs outside of the context of an HTTP request, like when generating the flag URLs in the Daily Report)
  4. Rely on request headers and validated X-Forwarded-* (and maybe the new Forwarded header) to override those default values in setup.xml. Of note, Tomcat provides RemoteIPValve which provides some of this functionality already and it is probably best to use it since it will also help the Tomcat logs have the proper values.

An ERDDAP server that is public facing is then probably best off generating relative urls and using the forwarded headers.

A private ERDDAP server that should redirect links to a public server is probably behind a proxy. The proxy server can then set appropriate X-Forwarded-* headers to ensure the generated links go to the public server (e.g. send X-Forwarded-Host: public-erddap.example.com).

Upside, non-proxied requests to erddap will then work fine as well and generate appropriate relative URLs.

I actually might have some space to do a first pass at a patch for this (it's impacting my own operations in different way). In addition, I'd update how ERDDAP gets the client IP address to use the same header scheme. I'd propose a configuration like

<!-- setup.xml -->

<!-- for backwards compatibility -->
<baseUrl>http://myerddap.example.com:8080</baseUrl>
<baseHttpsUrl>https://myerddap.example.com:8443</baseHttpsUrl>

<!--
  These are the defaults for absolute URLs that are generated where we DON'T have a user request or the user request
  is missing information.

  If the block is omitted, we can pull from baseHttpsUrl to determine the defaults. Https is chosen since, if you have SSL enabled, using it is never bad but not using it can be bad.
-->
<urlSettings>
<host>myerddap.example.com</host>
<protocol>https</protocol>
<prefix>erddap</prefix>  <!-- since you can move the ERDDAP WAR and change the prefix, we should define that here? -->
<port>8080</port>
</urlSettings>
// pseudo-code for URL building

public String make_url(uri_relative_to_prefix, HttpServletRequest request =null, bool make_absolute_url = False) {

  // Defaults are from the urlSettings configuration
  host = urlSettings.host;
  protocol = urlSettings.protocol;
  prefix = urlSettings.prefix;
  port = urlSettings.port;

  // Check and override with the HTTP request values (we can use RemoteIPValve to override all of these)
  if (request !== null) {
    host = request.getServerName();
    protocol = request.getScheme();
    port = request.getServerPort();
    prefix = request.getContextPath();

    // Of note, we can get the remote address like this
    remote_addr = request.getRemoteAddr() ;

  }
  else {
    make_absolute_urls = True;   // we need to make absolute URLs in this case since we have no way to understand relative URLs
  }
  relative_uri = "/" + prefix + "/" + uri_relative_to_prefix;
  if (make_absolute_urls) {
    # Should probably do some sanitization here
    return protocol + "://" + host + ":" + port + relative_uri
  }
  else {
    return relative_uri;
  }
}

RemoteIPValve will support using X-Forwarded-For and X-Forwarded-Proto then to get the Protocol and Remote Address. I think Server Name and Port should be fine if your proxy sets a proper Host header when proxying (to test). Overriding ContextPath might be harder and may need to be manually implemented - some Java frameworks have support at some versions and some don't, but I would propose X-Forwarded-Prefix or X-Forwarded-Path (instead of X-Script-Name) for consistency.

srstsavage commented 1 year ago

Ah, I'd forgotten about our friend RemoteIPValve. We could pretty easily add it into server.xml in the the docker-erddap image here:

https://github.com/axiom-data-science/docker-erddap/blob/main/update-server-xml.sh

Re: the context path, while I like the idea of respecting a X-Forwarded-Path header I think ERDDAP may make assumptions that it's served at /erddap/ in other places and may break if that assumption isn't true, others can correct me.

BobSimons commented 1 year ago

@srstsavage is correct: in several places, ERDDAP code makes the assumption that the server is installed at /erddap/. Among other reasons, this assumption makes it easy to identify an ERDDAP URL (e.g., a data source that is a remote ERDDAP).

BobSimons commented 1 year ago

@turnbullerin, I don't think your use of will work because I think (sorry, it's hard for me to check right now to be certain), that the setup.xml file is read and processed first by turning it into a set of key=value pairs. So using a nested xml construct would require changing all that code (not impossible, but extra work).

And I warn again to everyone who blithely assumes they can easily make significant changes: it may seem easy to make a change (e.g., absolute to relative URLs) and for a "first pass" (as you say) it is, but that ignores all the little places scattered throughout the code that will suddenly become bugs that you won't catch (and which you won't see until various things stop working and people start reporting bugs or just start thinking that ERDDAP is yet another lame, buggy program). And it ignores the fact that there may be situations where the current system has advantages. So again my advice to Chris is: if you decide to make these changes, do them in a way that ensures that the current system can continue to be used with absolutely no changes to the way the current system works, but allow this optional system for people who want to experiment with it (if you choose to support it). But it is up to Chris (or whoever makes the changes which Chris will choose whether or not to accept).

I also want to caution against making changes just because something is now in fashion or "what other frameworks do". Note that netcdf-java development has basically gone sideways (few new features) for the last ~5 years as the code has been refactored and modernized. In this process, all existing code that used the netcdf-java library has been broken (probably a couple of times with different releases) and numerous bugs have been introduced in netcdf-java. It's a mess, and I pity the new developers who are trying to get everything working again. Refactoring should never be undertaken lightly. Bugs are horrible things. They can ruin a software package's reputation. If you refactor, you need to be 100% certain you're not introducing bugs.

I think many people are used to working on small programs (a couple hundred lines), where, if you make a mistake during refactoring, it is easy to find and fix. ERDDAP is now a large complex program. I'm not saying you shouldn't request/make changes like this. I'm just saying that it's a good idea to be far more cautious.

But that's advice from a (re)tired old guy.

srstsavage commented 1 year ago

Ah, right, another reason to keep setup.xml flat is that we'll want to have all the settings assignable via ERDDAP_ environment variables, so maybe <urlHost>, <urlPort>, etc which map to ERDDAP_urlHost, ERDDAP_urlPort, etc or similar would work.

That's a fair and useful warning about the complexity of the codebase and using caution when making changes. I'll offer that this is exactly the kind of thing that a standard, comprehensive jUnit or similar test suite helps with. If the tests run out of the box in any environment and have good coverage, developers can feel more confident that their changes don't introduce unintended bugs because all expected functionality is tested. And obviously this is greatly helpful when upgrading dependencies, Java runtimes, etc. Working up such a suite would obviously be a large undertaking but it seems worthwhile to me if we want to attract a community of contributors.

BobSimons commented 1 year ago

Your comments about tests (especially about being more confident that changes causing bugs) are correct. Note that ERDDAP already has a huge number of tests (although not in jUnit). Those tests have uncovered unexpected problems related to code changes numerous times. I think Chris is evaluating moving the tests to some framework (Maven?). Although I tried to make them all easy to run in any environment (largely via the erddapTest and erddapBigTest directory), there are still a lot of tests dependent on other files or other things (e.g., ERDDAP, Postgres and Cassandra running on the local computer). And many tests occasionally fail because some remote dependency (e.g., a THREDDS or ERDDAP, or a given dataset in them) is temporarily unavailable. And many speed tests fail on a busy computer (e.g., when running lots of tests) but pass on a not busy computer. Basically, it wasn't and still isn't easy to make all of the tests easy for other people to run. It's a very worthy project, but definitely a time-consuming project and one that will be difficult or impossible to complete because ERDDAP interacts with so many types of live external services and changing datasets. It is well worth continuing to work towards making it easy for any developer to run the 80% of the tests that can be easily run.

That said, I think it would be a mistake to go with the common policy of assuming that the tests will catch all problems and that if all of the tests pass then the change can be accepted. ERDDAP is big and complex. There are a lot of subtle details. It's really hard to make the test coverage so complete that it catches all bad consequences of any possible change (e.g., are you going to test every generated URL on every type of web page?). I think it is much better in general to only make changes where you are pretty darn sure you fully understand all of the consequences and then use the tests as a backup system. If that isn't possible, maybe the change is not worth the risk. I know that is not how things are commonly done now but it still seems like the right way to do things. Again, it would be really bad if ERDDAP devolved into yet another buggy, unreliable program.

Again, that's my opinion. Chris is the person who has now accepted the responsibility for deciding which changes are accepted. He's the one who takes responsibility for stewarding ERDDAP and ensuring that ERDDAP is as bug free as possible and he will have to deal with any problems. Please be understanding of that responsibility. So if he decides a given change isn't worth the risk, please respect that decision.

srstsavage commented 1 year ago

All great points! It would be wonderful to be able to use the extensive effort that you put into the existing custom tests in a standard framework. Totally agree that even very comprehensive tests couldn't be blindly trusted, but it could give contributors a good sense that their changes were ready for review by others and prevent wasted time if an unexpected bug was introduced.

In my opinion GeoServer is a great analogue for what we're talking about: the code base is massive and complex, end users expect stability, and they have a very comprehensive test suite which ensures expected behaviors and performance, but there's a small team of project leads who understand the project vision and all parts of the code who can make the final say on all changes before merging.