ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
78 stars 57 forks source link

HTML Entity Name munging in XML listings #103

Open benjwadams opened 1 year ago

benjwadams commented 1 year ago

ERDDAP does some bizarre name munging to HTML entities in XML listings.

For example in https://gcoos4.tamu.edu/erddap/metadata/iso19115/xml/ there are numerous href values like this 2004JuvenileSportfishNOAA_DATA_Mean_v0_0_iso19115.xml

Most browsers will transform this, but I have had issues with following links in some Python libraries if these HTML entities aren't explicitly escaped beforehand. It's also a pretty odd way to represent simple characters like periods and underscores where the usual characters would suffice. Any reason why these characters shouldn't be used instead of encoding to HTML entities?

BobSimons commented 1 year ago

It is the attributes of HTML and XML tags that must be strongly encoded, for security reasons. The code that does this is in com/cohort/util/XML.java in the method called encodeAsHTMLAttribute. The JavaDoc for that method explains:

 * For security reasons, for text that will be used as an HTML or XML attribute, 
 * this replaces non-alphanumeric characters with HTML Entity &#xHHHH; format.
 * See HTML Attribute Encoding at
 * [https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf](https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf)
 * pg 188, section 25.4 
 * "Encoding Type: HTML Attribute Encoding
 * Encoding Mechanism: 
 * Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH;
 * format, including spaces. (HH = Hex Value)".
 * On the need to escape HTML attributes: [http://wonko.com/post/html-escaping](http://wonko.com/post/html-escaping)

Both of the links there are interesting reading.

One might argue that in some circumstances this strict encoding is not necessary. Perhaps. Perhaps not. The problem is that it is very time consuming (even if we assume the programmer has 100% understanding of the situation) and error prone to try to make that determination. It is vastly simpler and (more important) vastly safer to just routinely encode all attributes in the safe and recommended way.