Closed simonw closed 3 years ago
The debugger showed me that it broke on a string that looked like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note>
<h1 title="Q3 2018 Reflection & Development">
<span title=Q3 2018 Reflection & Development">
Q3 2018 Reflection & Development
</span>
</h1>
...
Yeah that is not valid XML!
Not sure why I was round-tripping the content_xml
like that - I will try not doing that.
It looks like I was using the round-trip to dump the <?xml version="1.0" encoding="UTF-8" standalone="no"?>
and <!DOCTYPE
prefixes.
I tried this ampersand fix: https://regex101.com/r/ojU2H9/1
# https://regex101.com/r/ojU2H9/1
_invalid_ampersand_re = re.compile(r'&(?![a-z0-9]+;)')
def fix_bad_xml(xml):
# More fixes for things like '&' not as part of an entity
return _invalid_ampersand_re.sub('&', xml)
Even with that I'm still getting total garbage in the <en-note>
content - it's just HTML, not even trying to be XML.
Got this error today: