A recent libxml2 version (somewhere between 2.9 and 2.12) as used by lxml ≥ 5 became more picky about validating xsd schema. Validating the schema is done as part of checking if a file is valid XML (xml_reader.py::XMLreader::validate_xml()), which is used by sasdata to check if a file matches one of the cansas formats (cansas_reader.py::Reader::is_cansas()). The result is that is_cansas raises exceptions rather than returning bool for the file format. This has caused the test suite to fail.
The problematic part of cansas1d_invalid_v1_0.xsd [1] is where it tries to give lots of flexibility to the SASentryType (line 70) to allow for missing elements, but in doing so the schema becomes ambiguous in the eyes of libxml2. The issue is the three groups in the sequence (any, SASdata, any); if I have understood the problem correctly, this is ambiguous because there there are multiple ways to divide the sequence with the any entries also able to gobble up the SASdata elements.
This PR address the issue by changing the sequence to a repeatable choice between the known metadata elements, SASdata, and final arbitrary any. While sequence defines that elements must be in a strict order, the repeatable choice allows them to be in any order and also appear multiple times. That's actually provides slightly more freedom to be invalid cansas but still readable by sasdata than the current schema.
This seems to work fine, returning the correct number of SASdata elements when reading in the test files, which was the main thing I was concerned would be wrong from this change. (However, I'm far from an xsd expert.)
A recent
libxml2
version (somewhere between 2.9 and 2.12) as used bylxml
≥ 5 became more picky about validatingxsd
schema. Validating the schema is done as part of checking if a file is valid XML (xml_reader.py::XMLreader::validate_xml()
), which is used bysasdata
to check if a file matches one of the cansas formats (cansas_reader.py::Reader::is_cansas()
). The result is thatis_cansas
raises exceptions rather than returningbool
for the file format. This has caused the test suite to fail.The problematic part of
cansas1d_invalid_v1_0.xsd
[1] is where it tries to give lots of flexibility to theSASentryType
(line 70) to allow for missing elements, but in doing so the schema becomes ambiguous in the eyes oflibxml2
. The issue is the three groups in thesequence
(any
,SASdata
,any
); if I have understood the problem correctly, this is ambiguous because there there are multiple ways to divide the sequence with theany
entries also able to gobble up theSASdata
elements.[1]
sasdata/dataloader/readers/schema/cansas1d_invalid_v1_0.xsd
This PR address the issue by changing the
sequence
to a repeatablechoice
between the known metadata elements,SASdata
, and final arbitraryany
. Whilesequence
defines that elements must be in a strict order, the repeatablechoice
allows them to be in any order and also appear multiple times. That's actually provides slightly more freedom to be invalid cansas but still readable bysasdata
than the current schema.This seems to work fine, returning the correct number of
SASdata
elements when reading in the test files, which was the main thing I was concerned would be wrong from this change. (However, I'm far from anxsd
expert.)Closes: #64