IAU-ADES / ADES-Master

ADES implementation based on a master XML file
26 stars 7 forks source link

Validation of XML Declaration / Preamble #29

Closed stevenstetzler closed 10 months ago

stevenstetzler commented 10 months ago

This PR changes the validation code to detect the presence of <?xml version='1.0' encoding='UTF-8'?> as the first line of a candidate XML file.

After looking into it, it doesn't seem there is a way to enforce this in the schema that is used for validation currently. From what I read, the schema validates the form of the XML contents while the version and encoding are ways of representing the XML contents in the file. So the XML preamble that declares the version and encoding is out of scope for schema validation.

I then checked if the absence of a declared version and encoding can be detected when the XML file is parsed. When using Python's lxml.etree.parse function, the version and encoding are stored in the returned lxml.etree._ElementTree.docinfo.xml_version and lxml.etree._ElementTree.docinfo.encoding attributes. However, when the XML preamble is not included in an XML file, the lxml parser defaults to xml_version=1.0 and encoding='UTF-8' which means this can't be used to detect the presence/absence of the XML preamble in the XML file.

I settled for performing a regular expression match ^<\?xml.*\?> against the first non-blank line in the XML file: https://github.com/IAU-ADES/ADES-Master/commit/757a0c6712a6ff1fee92292ad9903bf57d212c7e#diff-a4c8786b360a478f5afa1fb5a4f1da23dec09efcfc38833c9437f460851ed499R57-R68 A message will be printed to stdout and also to an output file indicating that the candidate file does not have the XML preamble present.

In addition to this, I did some cleanup of the validation code, to remove duplicate code between valall.py, valgeneral.py, valsubmit.py, and validate.py since they were all performing similar actions. I also noticed in testing warnings about needing to quote control characters in the regular expression strings used in packUtils.py and sexVals.py, so I made that change too.

federicaspoto commented 10 months ago

Looks good to me @stevenstetzler.