TSSlade / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

multiple rows per column from 1 xml element #61

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Input xml file:
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/"    
xmlns:dcterms="http://purl.org/dc/terms/"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns:europeana="http://www.europeana.eu/schemas/ese/"
   xmlns="http://www.europeana.eu/schemas/ese/">
   <europeana:record>
      <dc:title>De militaire roem</dc:title>
      <dcterms:alternative>La Gloire militaire</dcterms:alternative>
      <dc:creator>Vanaise, Gustave</dc:creator>
      <dcterms:created>19e eeuw (laatste kwart)</dcterms:created>
      <dc:description>Vooraan links op het schilderij zit de Faam als
personificatie van de
         Militaire Roem. In haar linkerhand houdt ze een trompet vast, die
op haar knie steunt, in
         de andere hand het wapenschild van Brabant met de klauwende leeuw.
Links naast de Faam ligt
         een adelboek opengeslagen. Het wapenschild van Leuven en een reeks
niet-ingekleurde
         schilden voor de helden van de strijd zijn zichtbaar. Achteraan
deze figuur is een veldslag
         aan de gang. De hertog van Brabant, gezeten op een paard en met de
banier van Brabant in de
         rechterhand trekt met zijn ridders ten strijde. Aan de horizon
stijgen rookpluimen op boven
         een stad.</dc:description>
 <dc:description>Inscriptie G. Van Aise 1889</dc:description>
....

For the first dc:description element the software generates 4 rows:
1)
Vooraan links op het schilderij zit de Faam als personificatie van de
Militaire Roem. In haar linkerhand houdt ze een trompet vast, die op haar
knie steunt, in
2) 
de andere hand het wapenschild van Brabant met de klauwende leeuw. Links
naast de Faam ligt een adelboek opengeslagen. Het wapenschild van Leuven en
een reeks niet-ingekleurde
3)
schilden voor de helden van de strijd zijn zichtbaar. Achteraan deze figuur
is een veldslag aan de gang. De hertog van Brabant, gezeten op een paard en
met de banier van Brabant in de
4)
rechterhand trekt met zijn ridders ten strijde. Aan de horizon stijgen
rookpluimen op boven een stad.

The fifth row comes from the second dc:description element.

I can merge the cells, but then I get the 5 merged, since I do not have
clean separator.

Isn't there a possibility to keep the 4 cells in 1?

Original issue reported on code.google.com by p...@proxml.be on 26 May 2010 at 3:12

GoogleCodeExporter commented 8 years ago
I'll take this.  I'm creating some unit tests for the XmlImporter.

Original comment by iainsproat on 26 May 2010 at 5:20

GoogleCodeExporter commented 8 years ago
I reckon it's to do with the line breaks in the xml text node.

Original comment by iainsproat on 26 May 2010 at 6:27

GoogleCodeExporter commented 8 years ago
Probably, but I agree it shouldn't do it. Whitespace in XML (generally 
speaking) is not structurally significant.

Original comment by stefano.mazzocchi@gmail.com on 26 May 2010 at 6:42

GoogleCodeExporter commented 8 years ago
Written a unit test which *verifies* that line breaks are not at issue.

The problem I now think is that not all the europeana elements have the same 
structure.

i.e. One Europeana element has 10 nested elements, while another has only 5 
nested 
elements.

r862 provides a unit test for this case.

Original comment by iainsproat on 26 May 2010 at 7:24

GoogleCodeExporter commented 8 years ago

Original comment by iainsproat on 14 Oct 2010 at 10:16

GoogleCodeExporter commented 8 years ago
Returning to this issue, it seems to be the following combination of line 
returns and whitespace can break the xml importer "\n \n" ( Ux000AUx0020Ux000A 
in unicode).

This seems to be an issue with the XmlStreamReader rather than Refine's code, 
the following test will fail as the "Author1\n \nThe" is split into two tokens 
rather than one:

@Test
    public void testXmlStreamReaderWithLineBreak(){
        try {
            ByteArrayInputStream inputStream;
            inputStream = new ByteArrayInputStream( "<?xml version=\"1.0\"?><library>Author1,\n \nThe</library>".getBytes("UTF-8"));
            XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(inputStream);
            reader.next();//START_ELEMENT
            Assert.assertEquals(reader.getLocalName(),"library");
            reader.next();
            Assert.assertEquals(reader.getText(), "Author1,\n \nThe");
        } catch (UnsupportedEncodingException e) {
            Assert.fail(e.getMessage());
        } catch (XMLStreamException e) {
            Assert.fail(e.getMessage());
        } catch (FactoryConfigurationError e) {
            Assert.fail(e.getMessage());
        }
    }

Original comment by iainsproat on 25 Nov 2010 at 5:06

GoogleCodeExporter commented 8 years ago
but the following test passes (uses reader.getTextElement() rather than 
reader.getText()):

    @Test
    public void testXmlStreamReaderWithLineBreak(){
        try {
            ByteArrayInputStream inputStream;
            inputStream = new ByteArrayInputStream( "<?xml version=\"1.0\"?><library>Author1,\n \nThe</library>".getBytes("UTF-8"));
            XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(inputStream);
            reader.next();//START_ELEMENT
            Assert.assertEquals(reader.getLocalName(),"library");
            Assert.assertEquals(reader.getElementText(), "Author1,\n \nThe");
        } catch (UnsupportedEncodingException e) {
            Assert.fail(e.getMessage());
        } catch (XMLStreamException e) {
            Assert.fail(e.getMessage());
        } catch (FactoryConfigurationError e) {
            Assert.fail(e.getMessage());
        }
    }

I'll try to adapt the code to work around this.

Original comment by iainsproat on 25 Nov 2010 at 5:09

GoogleCodeExporter commented 8 years ago
Rev 1939 should fix this.  I've turned on text coalescing in the XML parser.  
As an added bonus, I also turned on XML entity replacement so it doesn't have 
to be done after the import.

Original comment by tfmorris on 27 Nov 2010 at 10:09

GoogleCodeExporter commented 8 years ago
Awesome, thanks Tom.  I wasn't even aware about the IS_COALESCING property.

Original comment by iainsproat on 27 Nov 2010 at 10:47

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 9 Jun 2011 at 7:58