Closed GoogleCodeExporter closed 9 years ago
I'll take this. I'm creating some unit tests for the XmlImporter.
Original comment by iainsproat
on 26 May 2010 at 5:20
I reckon it's to do with the line breaks in the xml text node.
Original comment by iainsproat
on 26 May 2010 at 6:27
Probably, but I agree it shouldn't do it. Whitespace in XML (generally
speaking) is not structurally significant.
Original comment by stefano.mazzocchi@gmail.com
on 26 May 2010 at 6:42
Written a unit test which *verifies* that line breaks are not at issue.
The problem I now think is that not all the europeana elements have the same
structure.
i.e. One Europeana element has 10 nested elements, while another has only 5
nested
elements.
r862 provides a unit test for this case.
Original comment by iainsproat
on 26 May 2010 at 7:24
Original comment by iainsproat
on 14 Oct 2010 at 10:16
Returning to this issue, it seems to be the following combination of line
returns and whitespace can break the xml importer "\n \n" ( Ux000AUx0020Ux000A
in unicode).
This seems to be an issue with the XmlStreamReader rather than Refine's code,
the following test will fail as the "Author1\n \nThe" is split into two tokens
rather than one:
@Test
public void testXmlStreamReaderWithLineBreak(){
try {
ByteArrayInputStream inputStream;
inputStream = new ByteArrayInputStream( "<?xml version=\"1.0\"?><library>Author1,\n \nThe</library>".getBytes("UTF-8"));
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(inputStream);
reader.next();//START_ELEMENT
Assert.assertEquals(reader.getLocalName(),"library");
reader.next();
Assert.assertEquals(reader.getText(), "Author1,\n \nThe");
} catch (UnsupportedEncodingException e) {
Assert.fail(e.getMessage());
} catch (XMLStreamException e) {
Assert.fail(e.getMessage());
} catch (FactoryConfigurationError e) {
Assert.fail(e.getMessage());
}
}
Original comment by iainsproat
on 25 Nov 2010 at 5:06
but the following test passes (uses reader.getTextElement() rather than
reader.getText()):
@Test
public void testXmlStreamReaderWithLineBreak(){
try {
ByteArrayInputStream inputStream;
inputStream = new ByteArrayInputStream( "<?xml version=\"1.0\"?><library>Author1,\n \nThe</library>".getBytes("UTF-8"));
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(inputStream);
reader.next();//START_ELEMENT
Assert.assertEquals(reader.getLocalName(),"library");
Assert.assertEquals(reader.getElementText(), "Author1,\n \nThe");
} catch (UnsupportedEncodingException e) {
Assert.fail(e.getMessage());
} catch (XMLStreamException e) {
Assert.fail(e.getMessage());
} catch (FactoryConfigurationError e) {
Assert.fail(e.getMessage());
}
}
I'll try to adapt the code to work around this.
Original comment by iainsproat
on 25 Nov 2010 at 5:09
Rev 1939 should fix this. I've turned on text coalescing in the XML parser.
As an added bonus, I also turned on XML entity replacement so it doesn't have
to be done after the import.
Original comment by tfmorris
on 27 Nov 2010 at 10:09
Awesome, thanks Tom. I wasn't even aware about the IS_COALESCING property.
Original comment by iainsproat
on 27 Nov 2010 at 10:47
Original comment by tfmorris
on 9 Jun 2011 at 7:58
Original issue reported on code.google.com by
p...@proxml.be
on 26 May 2010 at 3:12