Problem parsing XML files (launch.xml or lexers)

GoogleCodeExporter commented 8 years ago

I had lots of troubles including a new lexer XML file and Launch XML file, 
due to the lack of feedback from the GUI. After tracking the problem from 
the sources, SAX parser (src/syntax/synxml.py) fails with message 
"<unknown>:1:1: not well-formed (invalid token)".

It seems UTF-8 conversion gives SAX parser troubles, removing the line:

   txt = txt.decode('utf-8') # xml is utf-8 by specification

makes the whole parsing working again.

I don't know what's going on, my file has been edited using UTF-8 encoding 
(vim, under Linux, "echo $LANG" gives "fr_FR.UTF-8". Attached with this 
issue.

I see tow main issues here:

  1. why can't it parse this XML file ?
  2. maybe more important, GUI should report problems, it's very 
frustrating when Editra remains silent... :)

Thanks
Seb

Original issue reported on code.google.com by sebastie...@gmail.com on 1 Feb 2010 at 2:22

Attachments:

launch.xml

GoogleCodeExporter commented 8 years ago

Hi,

This is a very new feature. It was added in the previous version and is not 
complete,
it is just a basic working implementation right now. Ultimately I plan to have 
a gui
for adding new handlers as the xml is really just the protocol.

I am not sure at the moment why its failing to parse the xml file on your 
machine I
will have to try it later when I get a chance. But I would look at it with hex 
editor
and see if there may possibly be some hidden characters at the start of the file
possibly there is a BOM that is needs to be stripped prior to feeding it to the 
parser. 

Also your file is associating it self with ID_LANG_PYTHON for which there is 
already
a handler for so it would be easier to just edit the settings in the Launch
configuration dialog unless you wanted to associate the handler with a 
different file
type.

Original comment by CodyPrec...@gmail.com on 1 Feb 2010 at 2:45

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Hi Cody,

I dumped one of my XML files with 'hexdump', as you suggested. Attached are 
jal.edxml 
and resulting dumped jal.hexa. I don't see any BOM char or other weird chars.

As a workaround (which doesn't explain what's really going on), instead of 
using 
sax.parseString(string,handler), you can use sax.parse(filename,handler) in 
synxml.py. I tried and it seems to work, so I really wonder why this utf8 
decoding is 
needed...

xml.__version__ says "41660". Just in case...

Cheers,
Seb

Original comment by sebastie...@gmail.com on 5 Feb 2010 at 7:18

Attachments:

GoogleCodeExporter commented 8 years ago

Hi,

The decoding is needed because Editra expects all internal text data to be of 
Unicode
type and not of string type.

Luckily sax.parse does appear to render unicode tokens so switching to it work 
just fine.

Thanks,

Cody

Original comment by CodyPrec...@gmail.com on 5 Feb 2010 at 9:53

Changed state: Fixed

DeltaEscher / editra

Problem parsing XML files (launch.xml or lexers) #469