SasView / sasdata

Package for loading and handling SAS data
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

Remove need to pin lxml to below version 5.0 #64

Closed butlerpd closed 2 months ago

butlerpd commented 8 months ago

As of Feb 12, 2024, lxml is being pinned to versions below 5.0 (see PR #63) due to failing unit tests for Mac OS and Ubuntu. The root cause needs to be investigated so that this restriction can be removed to avoid permanently pinning to something that becomes an ancient version.

llimeht commented 3 months ago

The culprit appears to be a change in the schema validation function in libxml2 that lxml uses. I had noted earlier that the tests were still passing in Debian even though they were failing in Ubuntu - that is no longer the case.

test/sasdataloader/utest_cansas.py::cansas_reader_xml::test_invalid_cansas fails with libxml2 from unstable (2.12) but passes with libxml2 in testing (2.9).

The xsd [1] for recognising partly broken cansas files is to blame - the multiple xsd:any entries in the definition for SASentryType (line 70) make it ambiguous, when there are three groups in the sequence (any, SASdata, any), there are multiple ways to divide the sequence.

[1] sasdata/dataloader/readers/schema/cansas1d_invalid_v1_0.xsd

A test to run outside of the test harness is:

xmllint --noout \
  --schema sasdata/dataloader/readers/schema/cansas1d_invalid_v1_0.xsd \
  test/sasdataloader/data/cansas1d_notitle.xml

A namespace warning is OK; a "content model is not determinist" error or a "Schemas validity error" is not.