ES-DOC / pyesdoc

Python client to the ES-DOC eco-system.
Other
5 stars 5 forks source link

Unicode error when writing to ESDOC_ENCODING_XML format #1

Closed stephenpascoe closed 10 years ago

stephenpascoe commented 10 years ago

This example from the CMIP5 Q atom feed causes an error when converted to ESDOC_ENCODING_XML format. It looks like a unicode issue.

I haven't checked it thoroughly so it could be a problem with the source XML.

download the file in the comment then run the code below:

#
# Reproduce unicode bug when reading a METAFOR XML document and writing ESDOC XML
#

import pyesdoc

#  wget -O sim_eg.xml http://q.cmip5.ceda.ac.uk/cmip5/simulation/501b71b4-7783-11e0-99f9-00163e9152a5/1/
INFILE = './sim_eg.xml'
OUTFILE = './sim_eg.esdoc.xml'

def test1():
    doc = pyesdoc.read(INFILE, encoding=pyesdoc.METAFOR_CIM_XML_ENCODING)
    pyesdoc.write(doc, path=OUTFILE, encoding=pyesdoc.ESDOC_ENCODING_XML)

if __name__ == '__main__':
    test1()

This gives:

Traceback (most recent call last):
  File "unicode_bug.py", line 15, in <module>
    test1()
  File "unicode_bug.py", line 12, in test1
    pyesdoc.write(doc, path=OUTFILE, encoding=pyesdoc.ESDOC_ENCODING_XML)
  File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/io.py", line 69, in write
    f.write(serialization.encode(doc, encoding))
  File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/serialization.py", line 114, in encode
    return _serializers[encoding].encode(doc)
  File "/home/spascoe/git/esdoc/esdoc-py-client/src/pyesdoc/utils/serializer_xml.py", line 185, in encode
    return ET.tostring(as_xml, _UNICODE)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  ...
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 108: ordinal not in range(128)
stephenpascoe commented 10 years ago

Confirmed fixed for me in 909594f3f234b298978dc03a2efb50ef0f7619c6.