This PR changes the validation code to detect the presence of <?xml version='1.0' encoding='UTF-8'?> as the first line of a candidate XML file.
After looking into it, it doesn't seem there is a way to enforce this in the schema that is used for validation currently. From what I read, the schema validates the form of the XML contents while the version and encoding are ways of representing the XML contents in the file. So the XML preamble that declares the version and encoding is out of scope for schema validation.
I then checked if the absence of a declared version and encoding can be detected when the XML file is parsed. When using Python's lxml.etree.parse function, the version and encoding are stored in the returned lxml.etree._ElementTree.docinfo.xml_version and lxml.etree._ElementTree.docinfo.encoding attributes. However, when the XML preamble is not included in an XML file, the lxml parser defaults to xml_version=1.0 and encoding='UTF-8' which means this can't be used to detect the presence/absence of the XML preamble in the XML file.
In addition to this, I did some cleanup of the validation code, to remove duplicate code between valall.py, valgeneral.py, valsubmit.py, and validate.py since they were all performing similar actions. I also noticed in testing warnings about needing to quote control characters in the regular expression strings used in packUtils.py and sexVals.py, so I made that change too.
This PR changes the validation code to detect the presence of
<?xml version='1.0' encoding='UTF-8'?>
as the first line of a candidate XML file.After looking into it, it doesn't seem there is a way to enforce this in the schema that is used for validation currently. From what I read, the schema validates the form of the XML contents while the version and encoding are ways of representing the XML contents in the file. So the XML preamble that declares the version and encoding is out of scope for schema validation.
I then checked if the absence of a declared version and encoding can be detected when the XML file is parsed. When using Python's
lxml.etree.parse
function, the version and encoding are stored in the returnedlxml.etree._ElementTree.docinfo.xml_version
andlxml.etree._ElementTree.docinfo.encoding
attributes. However, when the XML preamble is not included in an XML file, thelxml
parser defaults toxml_version=1.0
andencoding='UTF-8'
which means this can't be used to detect the presence/absence of the XML preamble in the XML file.I settled for performing a regular expression match
^<\?xml.*\?>
against the first non-blank line in the XML file: https://github.com/IAU-ADES/ADES-Master/commit/757a0c6712a6ff1fee92292ad9903bf57d212c7e#diff-a4c8786b360a478f5afa1fb5a4f1da23dec09efcfc38833c9437f460851ed499R57-R68 A message will be printed tostdout
and also to an output file indicating that the candidate file does not have the XML preamble present.In addition to this, I did some cleanup of the validation code, to remove duplicate code between
valall.py
,valgeneral.py
,valsubmit.py
, andvalidate.py
since they were all performing similar actions. I also noticed in testing warnings about needing to quote control characters in the regular expression strings used inpackUtils.py
andsexVals.py
, so I made that change too.