IATI / pyIATI

pyIATI - a developer's toolkit for IATI - Deprecated - No longer supported
MIT License
5 stars 5 forks source link

Datasets do not validate as XML when created with a string containing a specified encoding #285

Open dalepotter opened 6 years ago

dalepotter commented 6 years ago

Linked to #24, datasets with an encoding declared do not validate as XML.

This example shows the problem using code from the master branch (v0.3.0):

$ python
Python 3.6.0 (default, Dec 24 2016, 08:02:28) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iati
>>> dataset_xml_declaration_no_encoding = iati.Dataset("""
... <?xml version="1.0"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_no_encoding)
True

vs. the same dataset with and encoding="UTF-8"? declared:

>>> dataset_xml_declaration_with_encoding = iati.Dataset("""
... <?xml version="1.0" encoding="UTF-8"?>
... <iati-activities version="xx">
...   <iati-activity>
...     <iati-identifier></iati-identifier>
...     <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...     <title>
...       <narrative>Xxxxxxx</narrative>
...     </title>
...     <description>
...       <narrative>Xxxxxxx</narrative>
...     </description>
...     <participating-org role="xx"></participating-org>
...     <activity-status code="xx"/>
...     <activity-date type="xx" iso-date="2013-11-27"/>
...     <activity-date type="xx" iso-date="2013-11-27">
...       <narrative>Xxxxxxx</narrative>
...     </activity-date>
...   </iati-activity>
... </iati-activities>
... """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding)
False

This latter XML (pastebin link for convenience) does validate as XML using two online XML validation sites: codebeautify and truugo

hayfield commented 6 years ago

Noting that relevant tests are in test_data.py#TestDatasetWithEncoding

hayfield commented 6 years ago

An XML string must not have any leading whitespace, as both these examples do.

https://github.com/IATI/pyIATI/blob/59588100911f4fd17a4012c0c2bc9632cc20efbd/iati/data.py#L87 undertakes some amount of stripping of leading and trailing whitespace, though an explicit encoding may cause complications. There is currently only one test relating to leading whitespace - this doesn't seem fully comprehensive!

https://github.com/IATI/pyIATI/blob/59588100911f4fd17a4012c0c2bc9632cc20efbd/iati/tests/test_data.py#L46-L53

hayfield commented 6 years ago
>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...  <iati-activities version="xx">
...    <iati-activity>
...      <iati-identifier></iati-identifier>
...      <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...      <title>
...        <narrative>Xxxxxxx</narrative>
...      </title>
...      <description>
...        <narrative>Xxxxxxx</narrative>
...      </description>
...      <participating-org role="xx"></participating-org>
...      <activity-status code="xx"/>
...      <activity-date type="xx" iso-date="2013-11-27"/>
...      <activity-date type="xx" iso-date="2013-11-27">
...        <narrative>Xxxxxxx</narrative>
...      </activity-date>
...    </iati-activity>
...  </iati-activities>
...  """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

As premised, this is a problem that the provided string is not valid XML because it contains leading whitespace. This is therefore a problem with an explicit encoding in combination with leading whitespace (the automatic removal of which is deemed to be a feature of pyIATI).

I will update the title to better reflect this.

dalepotter commented 6 years ago

I think the wrong string was tested! With no leading whitespace the same results come back...

No whitespace and no encoding

>>> dataset_xml_declaration_with_encoding_2 = iati.Dataset("""<?xml version="1.0"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_2)
True

No whitespace and a UTF-8 encoding

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """)
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
False

The error log tells us more...

>>> err_log = iati.validator.validate_is_xml(dataset_xml_declaration_with_encoding_3)
>>> len(err_log)
1
>>> err_log[0].name
'err-not-xml-not-string'
>>> err_log[0].info
"The value provided is a `<class 'str'>` rather than a `str`."

But... A workaround?!

However, when it is encoded to a bytes object all is well...

>>> dataset_xml_declaration_with_encoding_3 = iati.Dataset("""<?xml version="1.0" encoding="UTF-8"?>
...   <iati-activities version="xx">
...     <iati-activity>
...       <iati-identifier></iati-identifier>
...       <reporting-org type="xx" ref="xx"><narrative>Organisation name</narrative></reporting-org>
...       <title>
...         <narrative>Xxxxxxx</narrative>
...       </title>
...       <description>
...         <narrative>Xxxxxxx</narrative>
...       </description>
...       <participating-org role="xx"></participating-org>
...       <activity-status code="xx"/>
...       <activity-date type="xx" iso-date="2013-11-27"/>
...       <activity-date type="xx" iso-date="2013-11-27">
...         <narrative>Xxxxxxx</narrative>
...       </activity-date>
...     </iati-activity>
...   </iati-activities>
...   """.encode())
>>> iati.validator.is_xml(dataset_xml_declaration_with_encoding_3)
True

@hayfield mentioned that all tests for validation use bytes objects - I'd suggest adding some tests where we test strings.

hayfield commented 6 years ago

Due to the re-ordering of Dataset-creation operations in #286, the error occurs earlier under that branch. As such, that may be a better place to start from (also because it's a change that looks to explicitly separate how bytes and str objects are treated).

hayfield commented 6 years ago

The underlying error raised by lxml is: ValueError('Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.',) - will look to improve the visibility of this message.

hayfield commented 6 years ago

Changing from bug to enhancement since lxml does not support this feature, and so this would be some additional pyIATI functionality to convert strs to bytes where required.

hayfield commented 6 years ago

NOTE: This is only a problem at Python 3 due to the changes to what a str is.