bryonjacob / wikimodel

Automatically exported from code.google.com/p/wikimodel
0 stars 0 forks source link

XHTML parser does not consider quotes (") as a special symbol #37

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Input:

<p>This is &lt;a href=&quot;url&quot;&gt;some HTML&lt;/a&gt;.</p>

Expected events:

beginDocument
beginParagraph
onWord: [This]
onSpace
onWord: [is]
onSpace
onSpecialSymbol: [<]
onWord: [a]
onSpace
onWord: [href]
onSpecialSymbol: [=]
onSpecialSymbol: ["]
onWord: [url]
onSpecialSymbol: ["]
onSpecialSymbol: [>]
onWord: [some]
onSpace
onWord: [HTML]
onSpecialSymbol: [<]
onSpecialSymbol: [/]
onWord: [a]
onSpecialSymbol: [>]
onSpecialSymbol: [.]
endParagraph
endDocument

Got:

beginDocument
beginParagraph
onWord: [This]
onSpace
onWord: [is]
onSpace
onSpecialSymbol: [<]
onWord: [a]
onSpace
onWord: [href]
onSpecialSymbol: [=]
onWord: ["url"]
onSpecialSymbol: [>]
onWord: [some]
onSpace
onWord: [HTML]
onSpecialSymbol: [</]
onWord: [a]
onSpecialSymbol: [>.]
endParagraph
endDocument

Original issue reported on code.google.com by vmas...@gmail.com on 17 Aug 2008 at 2:16

GoogleCodeExporter commented 8 years ago
The real problem is that the XHTML Parser doesn't recognize XHTML entities

Original comment by vmas...@gmail.com on 21 Aug 2008 at 5:12

GoogleCodeExporter commented 8 years ago
Here's how we do it in XWiki land:

        // Parse the XHTML using an XML Parser and Wrap the XML elements in XMLBlock(s).
        // For each XML element's text, run it through the main Parser.

        XMLBlockConverterHandler handler = createContentHandler(parameters);

        try {
            XMLReader xr = XMLReaderFactory.createXMLReader();
            xr.setContentHandler(handler);
            xr.setErrorHandler(handler);
            xr.setEntityResolver(this.entityResolver);

            // Since XML can only have a single root node and since we want to allow
users to put
            // content such as the following, we need to wrap the content in a root node:
            // <tag1>
            // ..
            // </tag1>
            // <tag2>
            // </tag2>
            String normalizedContent =
                "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n"
                    + "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" +
"<root>" + content + "</root>";

            xr.parse(new InputSource(new StringReader(normalizedContent)));
        } catch (Exception e) {
            throw new MacroExecutionException("Failed to parse content as XML [" +
content + "]", e);
        }

Note that we have a special entity resolver so that it works even when internet 
is
off (or simply to speed things) and thus we have the 3 entity files 
(xhtml-lat1.ent,
xhtml-special.ent and xhtml-symbol.ent) + the DTD file (xhtml1-strict.dtd).

Original comment by mas...@gmail.com on 22 Aug 2008 at 7:10

GoogleCodeExporter commented 8 years ago
See
http://svn.xwiki.org/svnroot/xwiki/platform/core/trunk/xwiki-xml/src/main/java/o
rg/xwiki/xml/LocalEntityResolver.java
for the xwiki entity resolver implementation

Original comment by mas...@gmail.com on 22 Aug 2008 at 7:22

GoogleCodeExporter commented 8 years ago

Original comment by vmas...@gmail.com on 27 Aug 2008 at 1:41

GoogleCodeExporter commented 8 years ago
Fixed. This wasn't a problem of entity resolving.

Original comment by vmas...@gmail.com on 12 Sep 2008 at 7:15