xml.etree.ElementTree.ParseError: not well-formed (invalid token)

simonw commented 3 years ago

Got this error today:

(evernote-to-sqlite) /tmp % evernote-to-sqlite enex evernote.db simonwillison\'s\ notebook.enex 
Importing from ENEX  [######------------------------------]   17%
Traceback (most recent call last):
  File "/Users/simon/.local/bin/evernote-to-sqlite", line 8, in <module>
    sys.exit(cli())
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/evernote_to_sqlite/cli.py", line 31, in enex
    save_note(db, note)
  File "/Users/simon/.local/pipx/venvs/evernote-to-sqlite/lib/python3.9/site-packages/evernote_to_sqlite/utils.py", line 36, in save_note
    content = ET.tostring(ET.fromstring(content_xml)).decode("utf-8")
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/xml/etree/ElementTree.py", line 1347, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 132

simonw commented 3 years ago

The debugger showed me that it broke on a string that looked like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note>
  <h1 title="Q3 2018 Reflection & Development">
    <span title=Q3 2018 Reflection & Development">
      Q3 2018 Reflection & Development
    </span>
  </h1>
  ...

Yeah that is not valid XML!

simonw commented 3 years ago

https://github.com/dogsheep/evernote-to-sqlite/blob/36a466f142e5bad52719851c2fbda0c05cd35b99/evernote_to_sqlite/utils.py#L34-L42

Not sure why I was round-tripping the content_xml like that - I will try not doing that.

simonw commented 3 years ago

It looks like I was using the round-trip to dump the <?xml version="1.0" encoding="UTF-8" standalone="no"?> and <!DOCTYPE prefixes.

simonw commented 3 years ago

I tried this ampersand fix: https://regex101.com/r/ojU2H9/1


# https://regex101.com/r/ojU2H9/1
_invalid_ampersand_re = re.compile(r'&(?![a-z0-9]+;)')

def fix_bad_xml(xml):
    # More fixes for things like '&' not as part of an entity
    return _invalid_ampersand_re.sub('&amp;', xml)

Even with that I'm still getting total garbage in the <en-note> content - it's just HTML, not even trying to be XML.

dogsheep / evernote-to-sqlite

xml.etree.ElementTree.ParseError: not well-formed (invalid token) #13