Daisydiff fails to process certain invalid HTML files

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

I have a non-valid HTML document that crashes the diff (but NekoHTML can 
process it). Here is a very simplified version of it:

<html>
  <body>
    <a href="http://example.org"123></a>
  </body>
</html>

Those 123 at the end make DaisyDiff crash. If the extra text does not start 
with a number (e.g. the HTML5 standalon attribute 'required') then everything 
is OK.

What is the expected output? What do you see instead?

The expected output would probably be to ignore that invalid attribute.

What version of the product are you using? On what operating system?

daisydiff 1.2, OpenJDK 1.6.0_24 on OpenSuse, invoked from Groovy 1.8.6

Please provide any additional information below.

Here is the stack trace after the error (some lines are cut off by Groovy - I 
can provide those as well if required):

ERROR:  'An attribute whose value must be a QName had the value '123''
Caught: javax.xml.transform.TransformerException: java.lang.RuntimeException: 
An attribute whose value must be a QName had the value '123'
javax.xml.transform.TransformerException: java.lang.RuntimeException: An 
attribute whose value must be a QName had the value '123'
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endDocument(TransformerHandlerImpl.java:241)
        at org.outerj.daisy.diff.helper.MergeCharacterEventsHandler.endDocument(Unknown Source)
        at org.outerj.daisy.diff.helper.NekoHtmlParser$RemoveNamespacesHandler.endDocument(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.endDocument(Unknown Source)
        at org.cyberneko.html.filters.DefaultFilter.endDocument(DefaultFilter.java:217)
        at org.cyberneko.html.HTMLTagBalancer.endDocument(HTMLTagBalancer.java:446)
        at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2060)
        at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:910)
        at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
        at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.outerj.daisy.diff.helper.NekoHtmlParser.parse(Unknown Source)
        at org.outerj.daisy.diff.helper.NekoHtmlParser$parse.call(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
        at BooleanDiffer.cleanAndParse(quietDiff.groovy:35)
        at BooleanDiffer.this$2$cleanAndParse(quietDiff.groovy)
        at BooleanDiffer$this$2$cleanAndParse.callCurrent(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:46)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
        at BooleanDiffer.quietDiff(quietDiff.groovy:70)
        at BooleanDiffer$quietDiff.call(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
        at quietDiff.run(quietDiff.groovy:115)
        at groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:266)
        at groovy.lang.GroovyShell.run(GroovyShell.java:229)
        at groovy.lang.GroovyShell.run(GroovyShell.java:159)
        at groovy.ui.GroovyMain.processOnce(GroovyMain.java:550)
        at groovy.ui.GroovyMain.run(GroovyMain.java:337)
        at groovy.ui.GroovyMain.process(GroovyMain.java:323)
        at groovy.ui.GroovyMain.processArgs(GroovyMain.java:120)
        at groovy.ui.GroovyMain.main(GroovyMain.java:100)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108)
        at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)
Caused by: javax.xml.transform.TransformerException: 
java.lang.RuntimeException: An attribute whose value must be a QName had the 
value '123'
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:716)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endDocument(TransformerHandlerImpl.java:238)
        ... 42 more
Caused by: java.lang.RuntimeException: An attribute whose value must be a QName 
had the value '123'
        at com.sun.org.apache.xalan.internal.xsltc.runtime.BasisLibrary.runTimeError(BasisLibrary.java:1523)
        at com.sun.org.apache.xalan.internal.xsltc.runtime.BasisLibrary.runTimeError(BasisLibrary.java:1527)
        at com.sun.org.apache.xalan.internal.xsltc.runtime.BasisLibrary.checkAttribQName(BasisLibrary.java:1361)
        at GregorSamsa.template$dot$2()
        at GregorSamsa.applyTemplates()
        at GregorSamsa.template$dot$1()
        at GregorSamsa.applyTemplates()
        at GregorSamsa.template$dot$0()
        at GregorSamsa.applyTemplates()
        at GregorSamsa.transform()
        at com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:603)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:709)
        ... 44 more

Original issue reported on code.google.com by gre...@gmail.com on 24 Aug 2012 at 10:23

GoogleCodeExporter commented 8 years ago

As can be seen by the source code and the stack trace you posted, daisydiff 
uses Neko to convert HTML to valid XML , and then parses the result using SAX.

So if an exception is present it means that Neko failed and could NOT convert 
the result. Why do you say that NekoHTML can process it?

It seems to me that this might be a bug in Neko and not DaisyDiff.

Original comment by kkape...@gmail.com on 24 Aug 2012 at 10:31

GoogleCodeExporter commented 8 years ago

That was my thought as well which is why I took nekohtml.jar and 
xercesImpl-2.8.0.jar from my daisydiff version and ran the above HTML through 
nekohtml (using the sample Java program from 
http://nekohtml.sourceforge.net/usage.html). Neko came through the example 
without problems and printed the correct DOM tree.

I've zipped up the nekohtml test and uploaded it: 
https://dl.dropbox.com/u/1898992/nekoTest.zip

I can do more tests but I'd need some guidance as XML processing on Java isn't 
my forte.

Original comment by gre...@gmail.com on 24 Aug 2012 at 10:39

GoogleCodeExporter commented 8 years ago

I modified the neko test to also include a call to get attributes (uploaded to 
same URL) and it still doesn't crash.

Original comment by gre...@gmail.com on 24 Aug 2012 at 11:08

GoogleCodeExporter commented 8 years ago

So here's my theory (bear in mind I am no expert on NekoHTML or daisydiff):

NekoHTML parses the HTML into a DOM which is stored as Java objects. There is 
no checking if the attribute name is a valid XML name as the document currently 
isn't in XML form anyway (and I guess everything is valid in tagsoup HTML :)). 
Then daisydiff copies this attribute into XML along with all the others and the 
XML implementation rejects the final document as invalid.

If this is indeed the case, daisydiff should check all attributes whether their 
names are valid in XML and either drop them if they aren't or perhaps prefix 
them with an underscore or something to make them valid XML.

Original comment by gre...@gmail.com on 25 Aug 2012 at 9:45

GoogleCodeExporter commented 8 years ago

After more investigation, I found the solution in NekoHTML: it offers filters 
that can alter the processing of HTML. One in particular, called Purifier, 
ensures XML well-formedness. Using this solves the issue, I'm including the 
patch.

What it does is along the same lines as my proposed solution: it renames the 
invalid attribute name to start with valid characters.

Original comment by gre...@gmail.com on 28 Aug 2012 at 10:11

Attachments:

purify_html_before_parsing.diff

GoogleCodeExporter commented 8 years ago

Great find!

However what are the side effect on this? Does this filter break anything else?
Have you seen the unit tests contained in DaisyDiff? Do they still pass?

Original comment by kkape...@gmail.com on 28 Aug 2012 at 11:57

GoogleCodeExporter commented 8 years ago

This fix should not be applied :(

I've found out that this Purifier has some problems 
(http://sourceforge.net/tracker/?func=detail&atid=952178&aid=3497694&group_id=19
5122) and this is what has been causing the other issue I have reported.

Sorry if I have wasted anybody's time.

Original comment by gre...@gmail.com on 31 Aug 2012 at 2:02

GoogleCodeExporter commented 8 years ago

It is OK. That is the main problem with DaisyDiff. You fix something in one 
place, and something else breaks :-0

Original comment by kkape...@gmail.com on 31 Aug 2012 at 3:02

GoogleCodeExporter commented 8 years ago

Well it isn't daisydiff's problem this time. Hmm, seeing that the fix posted 
for the NekoHTML bug is simply subclassing the Purifier subclass, maybe we 
could do that in daisydiff? I'm just wondering why that person's fix wasn't 
accepted into NekoHTML itself, seeing as it was proposed several months ago..

Original comment by gre...@gmail.com on 31 Aug 2012 at 4:08

centraldesktop / daisydiff

Daisydiff fails to process certain invalid HTML files #46