Closed GoogleCodeExporter closed 9 years ago
Do you have a test HTML document that you can attach here as a file and where
this
can be reproduced?
Original comment by azizatif
on 6 May 2009 at 2:34
It looks like this is a bug in the HtmlAgilityPack. I tested with the Google
home
page, which contains a form with some input fields, and using the attached
IronPython
script. The result of my interactive test was:
IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import HtmlDocument
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count = 1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count = 8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 0
The problem is that the ChildNodes property of a form returns an empty
collection! As
a result, Fizzler fails to find anything within the descendants or immediate
children
of a form.
It looks like this issue is already logged with HtmlAgilityPack (but
unfortunately
with no resolution):
http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?WorkItemId=21782
Original comment by azizatif
on 6 May 2009 at 3:03
Attachments:
Original comment by azizatif
on 6 May 2009 at 3:04
There are alternatives to HtmlAgilityPack available. How much work would it be
to
drop HtmlAgilityPack and use something else as our default?
We should unit test this problem if we do swap.
Original comment by info%colinramsay.co.uk@gtempaccount.com
on 6 May 2009 at 3:08
> drop HtmlAgilityPack
I don't suggest dropping it. Just leave it in there as it is, but yet, drop it
as the
default perhaps if a more robust implementation is available.
> How much work would it be to and use something else as our default?
Shouldn't be a whole lot as long as the other supports a reasonable API
providing
access to attributes, children and siblings of a node.
Original comment by azizatif
on 6 May 2009 at 3:18
> drop HtmlAgilityPack and use something else as our default?
Now tracked separately as issue #25.
Original comment by azizatif
on 6 May 2009 at 3:23
Original comment by azizatif
on 6 May 2009 at 3:24
Guys, this is not an HTML agility pack comment. Check back the
http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?
WorkItemId=21782&ProjectName=htmlagilitypack page.
Original comment by simon_mo...@hotmail.com
on 18 May 2009 at 6:29
Thanks Simon. We'll take a look at this as HTMLAgilityPack actually worked just
fine
apart from this item. At the moment we created a new SgmlReader wrapper and are
using
that as our default. I could see us receiving further bug reports because of
this
behaviour, so we might just stick with the SgmlReader as a default but I'll
re-open
this issue for now.
Original comment by info%colinramsay.co.uk@gtempaccount.com
on 18 May 2009 at 8:29
Simon, thanks for your input on this issue. What would you recommend for the
value of
HtmlElementFlag for FORM? By default, it seems to be CanOverlap OR Empty. I
tried by
also turning on the Closed flag and that made it work. That is, with CanOverlap
OR
Closed OR Empty, one sees INPUT elements appear within descendants of FORM:
IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import *
>>> print HtmlNode.ElementsFlags['form']
10
>>> HtmlNode.ElementsFlags['form'] |= HtmlElementFlag.Closed
>>> print HtmlNode.ElementsFlags['form']
14
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count = 1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count = 8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 1
>>> def dump(node, level = 0):
... print ' ' * level, node.Name
... for child in node.ChildNodes:
... dump(child, level + 1)
...
>>> dump(form)
form
table
tr
td
#text
td
input
input
input
br
input
input
td
font
#text
a
#text
br
#text
a
#text
br
#text
a
#text
tr
td
font
span
#text
input
label
#text
input
label
#text
input
label
#text
Also, I see that this does not affect Fizzler directly, only its clients. It
does,
however, affect Visual and Console Fizzler utilities, which do happen to be
Fizzler
clients and perhaps which should now have an option to opt in for on behavior
or the
other with regard to FORM.
Original comment by azizatif
on 18 May 2009 at 8:49
Original comment by azizatif
on 30 Sep 2009 at 11:23
Original comment by azizatif
on 1 Oct 2009 at 6:19
Fixed in r256.
Original comment by azizatif
on 1 Oct 2009 at 6:20
Original comment by azizatif
on 8 Dec 2009 at 11:11
What alternatives are there to Html Agility Pack and SgmlReader? I have found
SgmlReader to be pretty slow.
Has anyone used HtmlUnit? We could use http://www.ikvm.net/ to convert that
library
to .net?
Original comment by jake....@gmail.com
on 9 Dec 2009 at 4:26
IKVM.net is excellent and so is HtmlUnit, but the tests I've done show that the
converted code is awfully slow to initialize and somewhat slower during
execution
than the unconverted code. That's not to say a patch from you implementing the
conversion and HtmlUnit as a DOM engine in Fizzler should be declined, though.
:)
Original comment by asbjornu
on 9 Dec 2009 at 10:27
I'm still noticing an issue where it seems to return null (no results) when
passing "form" or an ID of a form (eg. "#form1"), even when using LoadHtml2 or
in Visual Fizzler.
Has anyone else still had this issue?
Original comment by mmezza...@gmail.com
on 23 Jul 2010 at 4:14
This issue was closed by revision 073aa958b22b.
Original comment by azizatif
on 4 Jan 2013 at 8:31
This issue was closed by revision 9c7132c82f3c.
Original comment by azizatif
on 4 Jan 2013 at 10:56
Original issue reported on code.google.com by
asbjornu
on 6 May 2009 at 2:31