Open GoogleCodeExporter opened 8 years ago
As can be seen by the source code and the stack trace you posted, daisydiff
uses Neko to convert HTML to valid XML , and then parses the result using SAX.
So if an exception is present it means that Neko failed and could NOT convert
the result. Why do you say that NekoHTML can process it?
It seems to me that this might be a bug in Neko and not DaisyDiff.
Original comment by kkape...@gmail.com
on 24 Aug 2012 at 10:31
That was my thought as well which is why I took nekohtml.jar and
xercesImpl-2.8.0.jar from my daisydiff version and ran the above HTML through
nekohtml (using the sample Java program from
http://nekohtml.sourceforge.net/usage.html). Neko came through the example
without problems and printed the correct DOM tree.
I've zipped up the nekohtml test and uploaded it:
https://dl.dropbox.com/u/1898992/nekoTest.zip
I can do more tests but I'd need some guidance as XML processing on Java isn't
my forte.
Original comment by gre...@gmail.com
on 24 Aug 2012 at 10:39
I modified the neko test to also include a call to get attributes (uploaded to
same URL) and it still doesn't crash.
Original comment by gre...@gmail.com
on 24 Aug 2012 at 11:08
So here's my theory (bear in mind I am no expert on NekoHTML or daisydiff):
NekoHTML parses the HTML into a DOM which is stored as Java objects. There is
no checking if the attribute name is a valid XML name as the document currently
isn't in XML form anyway (and I guess everything is valid in tagsoup HTML :)).
Then daisydiff copies this attribute into XML along with all the others and the
XML implementation rejects the final document as invalid.
If this is indeed the case, daisydiff should check all attributes whether their
names are valid in XML and either drop them if they aren't or perhaps prefix
them with an underscore or something to make them valid XML.
Original comment by gre...@gmail.com
on 25 Aug 2012 at 9:45
After more investigation, I found the solution in NekoHTML: it offers filters
that can alter the processing of HTML. One in particular, called Purifier,
ensures XML well-formedness. Using this solves the issue, I'm including the
patch.
What it does is along the same lines as my proposed solution: it renames the
invalid attribute name to start with valid characters.
Original comment by gre...@gmail.com
on 28 Aug 2012 at 10:11
Attachments:
Great find!
However what are the side effect on this? Does this filter break anything else?
Have you seen the unit tests contained in DaisyDiff? Do they still pass?
Original comment by kkape...@gmail.com
on 28 Aug 2012 at 11:57
This fix should not be applied :(
I've found out that this Purifier has some problems
(http://sourceforge.net/tracker/?func=detail&atid=952178&aid=3497694&group_id=19
5122) and this is what has been causing the other issue I have reported.
Sorry if I have wasted anybody's time.
Original comment by gre...@gmail.com
on 31 Aug 2012 at 2:02
It is OK. That is the main problem with DaisyDiff. You fix something in one
place, and something else breaks :-0
Original comment by kkape...@gmail.com
on 31 Aug 2012 at 3:02
Well it isn't daisydiff's problem this time. Hmm, seeing that the fix posted
for the NekoHTML bug is simply subclassing the Purifier subclass, maybe we
could do that in daisydiff? I'm just wondering why that person's fix wasn't
accepted into NekoHTML itself, seeing as it was proposed several months ago..
Original comment by gre...@gmail.com
on 31 Aug 2012 at 4:08
Original issue reported on code.google.com by
gre...@gmail.com
on 24 Aug 2012 at 10:23