IATI / js-validator-api

Pure JavaScript IATI validator implementation
GNU Affero General Public License v3.0
1 stars 1 forks source link

The Validator produces an error for valid IATI XML files that have CR as the line feeds #546

Closed simon-20 closed 8 months ago

simon-20 commented 8 months ago

Brief Description The Validator gives an error code for any file that contains a CR character as a line feed. This occurs for files which use CR exclusively as the line feeds, and also for files that have mixed line feeds.

XML which contain a single CR character as the newline are, it would seem, valid, because using a single CR character was what old Macs used to do. The XML specification says that XML parsers must normalise line feeds to a single LF character. So, this bug would seem to suggest that the XML parsers being used are not standards compliant.

The files are initially processed with xmllint, and xmllint seems to handle things correctly.

However, the libxmljs2 library throws an error for files containing single CR line feeds.

The exact error returned depends on where the single CR line feed comes in the file. As a result, the HTTP error status returned by the Validator when encountering these files also varies. Mostly commonly it is 400, but sometimes it is 422.

This problem came to light because many/most of UNICEF's files have both CRLF and single CR newline sequences in them.

Severity Critical

Issue Location The problem code is in validatorServices.js, but it is the libxmljs2 library used by this code that is the root cause of the problem.

Steps to Reproduce Get a valid IATI XML file, and alter it so that at least one of the new line sequences is a single CR character. Then post it to the Validator.

You will likely see a 400 error.

simon-20 commented 8 months ago

This turned out not to be a problem with the Validator, but a bug in curl. curl has a bug which means it doesn't POST files with CR line endings in them properly, unless the --data-binary flag is used. Without that flag, curl will truncate the file in ways which usually mean it stops being valid XML, and this is why the Validator was returning 400. (CR line endings are valid in text files, including XML files: https://www.w3.org/TR/REC-xml/#sec-line-ends).