This example from the CMIP5 Q atom feed causes an error when converted to ESDOC_ENCODING_XML format. It looks like a unicode issue.
I haven't checked it thoroughly so it could be a problem with the source XML.
download the file in the comment then run the code below:
#
# Reproduce unicode bug when reading a METAFOR XML document and writing ESDOC XML
#
import pyesdoc
# wget -O sim_eg.xml http://q.cmip5.ceda.ac.uk/cmip5/simulation/501b71b4-7783-11e0-99f9-00163e9152a5/1/
INFILE = './sim_eg.xml'
OUTFILE = './sim_eg.esdoc.xml'
def test1():
doc = pyesdoc.read(INFILE, encoding=pyesdoc.METAFOR_CIM_XML_ENCODING)
pyesdoc.write(doc, path=OUTFILE, encoding=pyesdoc.ESDOC_ENCODING_XML)
if __name__ == '__main__':
test1()
This gives:
Traceback (most recent call last):
File "unicode_bug.py", line 15, in <module>
test1()
File "unicode_bug.py", line 12, in test1
pyesdoc.write(doc, path=OUTFILE, encoding=pyesdoc.ESDOC_ENCODING_XML)
File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/io.py", line 69, in write
f.write(serialization.encode(doc, encoding))
File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/serialization.py", line 114, in encode
return _serializers[encoding].encode(doc)
File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/utils/serializer_xml.py", line 185, in encode
return ET.tostring(as_xml, _UNICODE)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
...
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 108: ordinal not in range(128)
This example from the CMIP5 Q atom feed causes an error when converted to ESDOC_ENCODING_XML format. It looks like a unicode issue.
I haven't checked it thoroughly so it could be a problem with the source XML.
download the file in the comment then run the code below:
This gives: