libexpat fails on utf-8 data

nutsiepully commented 11 years ago

I downloaded the XML dump data from Wikipedia as part of our experiment and was trying to parse it using the LibExpat module

It was throwing a parse error for the xml file. (This is a small part of the entire dump)

To ensure that the file was valid, I verified it by parsing it in Ruby(nokogiri) and Python(ElementTree), and it passed for both.

The errors I received were:

ERROR: "ErrorException(\"Error parsing document : 0\"), no element found, 0x0000396b, 17, 1924369"
 in xp_parse at /Users/pulkitb/.julia/LibExpat/src/LibExpat.jl:274
(This is for a 15K line XML out of the whole fragment)

ERROR: "ErrorException(\"Error parsing document : 0\"), unclosed token, 0x00000093, 3, 9857"
 in xp_parse at /Users/pulkitb/.julia/LibExpat/src/LibExpat.jl:274
(This is for one element out of the whole xml file. It is the element from the above file that was throwing an error. )

Since, github doesn't allow attaching XMLs, I have copied the xml file (second error) http://pastebin.com/A1puALyw

From what I could notice, the error is directly thrown out of libexpats parse function, so I'm not sure if it can be fixed here.

vtjnash commented 11 years ago

The following is the minimal failure case:

<text>–</text>

Note that it is not a - (dash) but unicode/utf-8 character \xe2\x80\x93 (en-dash)

amitmurthy commented 11 years ago

Closed by commit https://github.com/amitmurthy/LibExpat.jl/commit/90ba9f93cab8232f198625f523c39ab365d92b21

nutsiepully commented 11 years ago

awesome! thanks guys.

JuliaIO / LibExpat.jl

libexpat fails on utf-8 data #6