The converter is unable to handle whitespace between tags

marmoure commented 10 months ago

The Saxes parser is parsing whitespace between tags as text nodes. This will results in the converter trying to place those text nodes as children which results in invalid tree eg:

Failed to convert: non-accepted-child: "text" node can't be a child of "document" node
    at DocumentNode.add

Input XML:

<!DOCTYPE topic PUBLIC "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN" "lw-topic.dtd">
<topic>
<title>Some title</title>
  <body>
    <p>asdf</p>
  </body>
</topic>

marmoure commented 10 months ago

@adamretter This can be fixed by introducing a pre processing function that strips off all the whitespace between the xml tags eg:

<!DOCTYPE topic PUBLIC "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN" "lw-topic.dtd">
<topic>
<title>Some title</title>
  <body>
    <p>asdf</p>
  </body>
</topic>

This will be refactored into

<!DOCTYPE topic PUBLIC "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN" "lw-topic.dtd"><topic><title>Some title</title><body><p>asdf</p></body></topic>

PS: We did look into the parser option and we didn't find any option that disables treating whitespace between tags as text nodes.

adamretter commented 10 months ago

The Saxes parser is parsing whitespace between tags as text nodes

@marmoure Saxes parser may well be correct, it depends on what the whitespace consists of and how the specifications define that it should be handled, see: https://www.w3.org/TR/xml/#sec-white-space

adamretter commented 10 months ago

This will be refactored into

What you are showing there is a text serialization of the XML. That text serialization has had indenting removed this is similar to indent=no in XSLT and XQuery (see https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent). It is very unlikely that we should need to write functions to change white space. If the existing parsers and serializers provided by the library are correct regards the specifications, then we should be able to rely on them - otherwise they have bugs and we need to send issue reports to the authors of those libraries.

adamretter commented 5 months ago

I have now looked into this in a bit more detail...

The Saxes parser is doing the right thing as far as parsing XML is concerned; I have also checked this against a Java Sax parser. The issue is that the Saxes parser is not DTD aware, and the DTD for LwDITA specifies further rules about where whitespace may appear or not.

The good news is that I think we should have the information about whether whitespace is significant or not in our AST classes as those are a Model of the DTD. If not we should add that information. That information will then allow us to make decisions on whether to accept or discard the whitespace output from Saxes.

My experiments are below...

1. Example in TypeScript to show how Saxes sees whitespace (doesn't know about DTD)

const parser = new saxes.SaxesParser({
    xmlns: true,
    fragment: false,
    position: true
});

let buf: string = ""

parser.on("opentag", ev => 
buf +="element{" + ev.name + "}{");

parser.on("text", ev => 
buf += "text(len=" + ev.length +"){" + ev + "}");

parser.on("closetag",
ev => buf += "}");

parser.on("error", err => {
console.error(err);
});

// Indented XML without Document Type Declaration
const xml = "<topic id='topicID'>\n    <title>some content</title>\n</topic>";

parser.write(xml);
parser.close();

console.log(buf);

The above produces this output:

element{topic}{text(len=5){
    }element{title}{text(len=12){some content}}text(len=1){
}}

2. Example in Java to show how Sax sees whitespace (with and without DTD)

import org.exist.util.StringInputSource;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;

public class Adam1SaxTest {

  static SAXParserFactory FACTORY = SAXParserFactory.newInstance();
  static {
    FACTORY.setNamespaceAware(true);
    FACTORY.setValidating(true);
  }

  public static void main(final String[] args) throws ParserConfigurationException, SAXException, IOException {

    final String xml = "<topic id='topicID'>\n    <title>some content</title>\n</topic>";

    System.out.println("*** WITHOUT Document Type Declaration");
    System.out.println();
    System.out.println(parse(xml));

    System.out.println();
    System.out.println();

    final String doctypeDecl = "<!DOCTYPE topic PUBLIC \"-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN\" \"lw-topic.dtd\">";

    System.out.println("*** WITH Document Type Declaration");
    System.out.println();
    System.out.println(parse(doctypeDecl + xml));
  }

  private static class MyHandler extends DefaultHandler implements ContentHandler {
    final StringBuilder buf = new StringBuilder();

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) {
      buf.append("element{" + localName + "}{");
    }

    @Override
    public void characters(char[] ch, int start, int length) {
      String text = new String(ch, start, length);
      buf.append("text(len=" + length +"){" + text + "}");
    }

    @Override
    public void endElement(String uri, String localName, String qName) {
      buf.append("}");
    }
  }

  private static String parse(final String xml) throws ParserConfigurationException, SAXException, IOException {
    SAXParser saxParser = FACTORY.newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();

    MyHandler contentHandler = new MyHandler();
    xmlReader.setContentHandler(contentHandler);

    xmlReader.parse(new StringInputSource(xml));

    return contentHandler.buf.toString();
  }
}

The above produces this output:

*** WITHOUT Document Type Declaration

element{topic}{text(len=5){
    }element{title}{text(len=12){some content}}text(len=1){
}}

*** WITH Document Type Declaration

element{topic}{element{title}{text(len=12){some content}}}

Note that the output for *** WITHOUT Document Type Declaration from Java SAX is exactly the same as from TypeScript Saxes. Therefore Saxes is doing the right thing for basic parsing (i.e. without a DTD).

If we consider the difference between<topic> <title> where the DTD instructs the parser that the whitespace between the topic and title elements is non-significant, and <p> <b> where the DTD says the whitespace is significant.

The DTD definition of topic:

<!ELEMENT topic   (title, shortdesc?, prolog?, body?)  >

The DTD definition of p:


<!ELEMENT p             (%all-inline;)*        >

<!ENTITY % all-inline "#PCDATA|%ph;|image|xref|%data;"> <!ENTITY % ph "ph | b | i |u | sup | sub" > <!ENTITY % data "data">



The `#PCDDATA` (parsed character data) that is defined as allowed within a `<p>` element means that it may contain a text node. Clearly the definition for `p` is `(%all-inline;)*`, and the `*` character means that there can be any number and mix of the types defined in `all-inline`. This effectively means that `p` allows "mixed content" (i.e. elements and nodes). 

By comparison, the definition for `topic` is clear that there can be no `#PCDATA` as a child, only the elements `title`, `shortdesc`, `prolog`, and `body` are allowed.

evolvedbinary / lwdita

The converter is unable to handle whitespace between tags #129

1. Example in TypeScript to show how Saxes sees whitespace (doesn't know about DTD)

2. Example in Java to show how Sax sees whitespace (with and without DTD)