VAMDC / NodeSoftware

Python/Django-based software for running VAMDC data nodes.
http://www.vamdc.eu
GNU General Public License v3.0
15 stars 23 forks source link

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

Open johannespostler opened 11 years ago

johannespostler commented 11 years ago

Any datafield in the database that holds data that is captured with the regex '&.+$' breaks the output of all XSAMS files. This regex fits to all HTML entities e.g. ä. If one of these is outputted through regex, they are not escaped, therefore breaking most browsers and the validator (if it doesn't happen to be a html entity). Browsers expect a semicolon as the sixth character after the ampersand.

Testcase: http://ideadb.uibk.ac.at/view/107/

The url field of this scan contains the following characters (within the link): 52fed736-74fc-11e2-9a8e-00000aacb35f&acdnat=1360663964_abbc8fd43c6ff547c477bb7648e5250d

Since this is a rather common pattern for URLs this is a problem.

ivh commented 11 years ago

This is indeed a problem, and related to #83 . However, the NoseSoftware cannot know if the database content is already escaped or not and we certainly do not want to escape twice. Therefore the node needs to make sure itself to not deliver things that break validation. This can either be done in the database itself (make an escaped copy of the column in question) or in the models.py by a small method that applies the escape function to the field.

johannespostler commented 11 years ago

I agree - we cannot just escape by default. Sometimes even I as a database provider don't know what content a field has - e.g. a comment field for one piece of data. I can't rule out that somebody puts a series of ampersands there...

However, we could check whether the content of URL in a Source is already encoded. The escaping function used (xml.sax.saxutils.escape) seems to be rather intelligent. My workaround will be to unescape and escape all content for the URL field. This should leave all content in an escaped state behind.