Open dalepotter opened 6 years ago
Noting that relevant tests are in test_data.py#TestDatasetWithEncoding
An XML string must not have any leading whitespace, as both these examples do.
https://github.com/IATI/pyIATI/blob/59588100911f4fd17a4012c0c2bc9632cc20efbd/iati/data.py#L87 undertakes some amount of stripping of leading and trailing whitespace, though an explicit encoding may cause complications. There is currently only one test relating to leading whitespace - this doesn't seem fully comprehensive!
>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True
As premised, this is a problem that the provided string is not valid XML because it contains leading whitespace. This is therefore a problem with an explicit encoding in combination with leading whitespace (the automatic removal of which is deemed to be a feature of pyIATI).
I will update the title to better reflect this.
I think the wrong string was tested! With no leading whitespace the same results come back...
>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True
>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
False
The error log tells us more...
>>> err_log = iati.validator.validate_is_xml(dataset_xml_declaration_with_encoding_3)
>>> len(err_log)
1
>>> err_log[0].name
'err-not-xml-not-string'
>>> err_log[0].info
"The value provided is a `<class 'str'>` rather than a `str`."
However, when it is encoded to a bytes
object all is well...
>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
... <iati-activity>
... <iati-identifier></iati-identifier>
... <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
... <title>
... <narrative>Xxxxxxx</narrative>
... </title>
... <description>
... <narrative>Xxxxxxx</narrative>
... </description>
... <participating-org role="xx"></participating-org>
... <activity-status code="xx"/>
... <activity-date type="xx" iso-date="2013-11-27"/>
... <activity-date type="xx" iso-date="2013-11-27">
... <narrative>Xxxxxxx</narrative>
... </activity-date>
... </iati-activity>
... </iati-activities>
... """.encode())
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
True
@hayfield mentioned that all tests for validation use bytes objects - I'd suggest adding some tests where we test strings.
Due to the re-ordering of Dataset-creation operations in #286, the error occurs earlier under that branch. As such, that may be a better place to start from (also because it's a change that looks to explicitly separate how bytes
and str
objects are treated).
The underlying error raised by lxml is: ValueError('Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.',)
- will look to improve the visibility of this message.
Changing from bug
to enhancement
since lxml does not support this feature, and so this would be some additional pyIATI functionality to convert str
s to bytes
where required.
NOTE: This is only a problem at Python 3 due to the changes to what a str
is.
Linked to #24, datasets with an encoding declared do not validate as XML.
This example shows the problem using code from the master branch (v0.3.0):
vs. the same dataset with and
encoding="UTF-8"?
declared:This latter XML (pastebin link for convenience) does validate as XML using two online XML validation sites: codebeautify and truugo