PaulWoooong / luke

Automatically exported from code.google.com/p/luke
0 stars 0 forks source link

XMLExporter generating invalid XML, when special characters are present in a TermVector field #36

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create lucene doc with an apostrophe ( ' ) in the data of a TermVector'd 
field (e.g.  "HTML4 doesn't like this &this too")
2. export to XML
      IndexReader reader    = IndexReader.open(fsDir, false);
      XMLExporter exporter = new XMLExporter(reader, indexPath);
      File xmlout = new File(tmpfile);
      OutputStream os = new FileOutputStream(xmlout);
      Ranges ranges = new Ranges();
      int start = docid;
      int end   = start + 1;
      ranges.set(start, end);
      exporter.export(os, false, true, "index", ranges);
3. open with an HTML4 strict spec XML browser  (try IE)

What is the expected output? What do you see instead?
should open and display as parsed XML.  instead, gives an error of invalid XML

What version of the product are you using? On what operating system?
luke 1.0.1 on windows 7.

Please provide any additional information below.
Andrzej fixed the majority of this problem in Luke 0.9.9 (when inside field 
data), but there is still a small fix remaining in org.getopt.luke.XMLExporter, 
to not escape element attribute values (patch attached). 

This patch also provides a minor correction to Util.xmlEscape()

The ' isn't a valid part of the HTML4 strict spec.  So, the xml escapes 
should generate output which is valid and can be rendered with any XML 
interpreter.  Some of the browser-based XML viewers choke on the ' when it 
is inside of element attributes.  ' will take care of it

Original issue reported on code.google.com by Craig.St...@gmail.com on 16 Apr 2011 at 3:56

Attachments:

GoogleCodeExporter commented 9 years ago
This has been fixed in rev. 55 (branch-3x) and rev. 56 (trunk). Thank you!

Original comment by sig...@gmail.com on 27 Apr 2011 at 10:14