Translation tooltip images have extra encoding in paths

turnbullerin commented 2 years ago

Hi!

I'm setting up ERDDAP behind an nginx proxy and using the sub module to rewrite the HTML content to have the proper URLs for our host (mostly because I'm planning a federated install at some point in the future and will only have one host name, so I'm doing something like http://publichost/1/erddap --> http://erddap1:8080/erddap, http://publichost/2/erddap --> http://erddap2:8080/erddap).

One issue I've noticed with this is that some links are getting very encoded to the point where it's hard to make good rewrite rules (notably where there's a tooltip and an image) - this seems to be the doing of XML.encodeAsHTMLAttribute() which is converting everything that isn't a digit or letter into an entity. The result is a URL like http://erddap1.org:8080/erddap/index.html is actually http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;png which makes using sub more challenging.

It would be nice for these URLs to be consistent with the rest of the URLs on the page (which are not encoded in this fashion). I think it is safe to include :/. in the characters that encodeAsHTMLAttribute() would not encode at least, and generally only &, quotation marks, and those characters not supported in your character encoding need encoding (so codepoints > 255)

BobSimons commented 2 years ago

I admit that some of the text (e.g., urls) in HTML attributes are excessively encoded.

However, I used to have a method (I think it was called minimallyEncodeAsHTMLAttribute()) which encoded a few more characters than what you suggested to minimally encode text that was stored as HTML attributes. However, that is insufficient for security reasons, notably the danger or cross-site scripting (XSS). See https://wonko.com/post/html-escaping and https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet#Output_Encoding_Rules_Summary [that's now out-of-date, perhaps you can find the newer reference].

You can say, "yeah, but surely you don't need to encode that aggressively". But I think the problem was that ERDDAP was being flagged by security scans because of the possibility of XSS vulnerabilities. The mere presence of certain characters (characters you wouldn't think to be trouble) was flagged.

So the situation is not a simple as one might expect and the safe thing was to aggressively encode the attributes (which has not caused any problems for ~8 years, until now). I suspect the happy medium is somewhere between what minimallyEncodeAsHTMLAttribute did and the current system, but it is exceedingly hard to know what that set of characters that need to be encoded (to avoid trouble with security scans) is.

So, I will try to less aggressively encode those particular urls so the problem goes away for you.

You say "it's hard to make good rewrite rules", but encodeAsHTMLAttributes just has one rule: if the character isn't a letter or a digit, then encode it as &#xhh; (where hh is the hexadecimal version of the character number). That is a legitimate encoding of the HTML attribute. So it seems like a solution for you is simply to decode those HTML attributes via the inverse of that rule. This is a subset of the standard way of encoding HTML attributes. You'll always have to deal with the possibility of some characters being encoded this way. I don't know what tools you have at your disposal in NGINX, but hopefully you can write the inverse of that rule if a built in function doesn't already exist.

I admit I don't quite understand the problem you face. After all, when the browser reads the HTML, one of the first things it does is decode the attributes. It's not like a browser is ever going to redirect anyone to http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;html. User's get redirected to the decoded URL. So the problem is that your system (NGINX?) isn't decoding the attribute, which is something it needs to do in any case. The current encoding may be excessive, but it isn't illegal or incorrect.

BobSimons commented 2 years ago

You said "it's hard to make good rewrite rules", but you didn't say what tools you have at your disposal. Since you say it's hard, I'm guessing it isn't a programming language and uses regular expressions (regex's) instead. I'm guessing you're trying to match the stated URL and replace it with something else, so the real problem is finding the first part of the URL. If I'm guessing correctly, can't you just use a capture group (variant1|variant2|variant3) //but with any number of variants to match the different forms of the first part of the URL?, e.g., (http://my.domain:8080/|http&x3a;&x2f;&x2f;my&x2e;domain&x3a;8080&x2f;) ? I could easily be wrong about your situation. If so, please tell me more of the details of what you need to do and what tools you have (e.g., regex's).

BobSimons commented 2 years ago

I found the updated link for the OWASP Cheatsheets Book: See https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf page 188, section 25.4, says Encoding Type: HTML Attribute Encoding Encoding Mechanism: Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH; format, including spaces. (HH = Hex Value)"

So my current method is following the OWASP advice exactly. I think the security scan report pointed me to this originally, as I wouldn't have found it on my own. So I am disinclined to change the encodeAsHTMLAttribute() method in my code, as it is likely that this will just generate warnings/errors on the security scan report.

It would be great if we could solve the problem you face via a change in your NGINX configuration.

ERDDAP / erddap

Translation tooltip images have extra encoding in paths #81