SasView / sasdata

Package for loading and handling SAS data
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

Fix tests (and loading slightly invalid cansas 1.0 data) with lxml ≥ 5 #81

Closed llimeht closed 2 months ago

llimeht commented 3 months ago

A recent libxml2 version (somewhere between 2.9 and 2.12) as used by lxml ≥ 5 became more picky about validating xsd schema. Validating the schema is done as part of checking if a file is valid XML (xml_reader.py::XMLreader::validate_xml()), which is used by sasdata to check if a file matches one of the cansas formats (cansas_reader.py::Reader::is_cansas()). The result is that is_cansas raises exceptions rather than returning bool for the file format. This has caused the test suite to fail.

The problematic part of cansas1d_invalid_v1_0.xsd [1] is where it tries to give lots of flexibility to the SASentryType (line 70) to allow for missing elements, but in doing so the schema becomes ambiguous in the eyes of libxml2. The issue is the three groups in the sequence (any, SASdata, any); if I have understood the problem correctly, this is ambiguous because there there are multiple ways to divide the sequence with the any entries also able to gobble up the SASdata elements.

[1] sasdata/dataloader/readers/schema/cansas1d_invalid_v1_0.xsd

This PR address the issue by changing the sequence to a repeatable choice between the known metadata elements, SASdata, and final arbitrary any. While sequence defines that elements must be in a strict order, the repeatable choice allows them to be in any order and also appear multiple times. That's actually provides slightly more freedom to be invalid cansas but still readable by sasdata than the current schema.

This seems to work fine, returning the correct number of SASdata elements when reading in the test files, which was the main thing I was concerned would be wrong from this change. (However, I'm far from an xsd expert.)

Closes: #64