GrowingData / fizzler

Automatically exported from code.google.com/p/fizzler
GNU General Public License v3.0
0 stars 0 forks source link

QuerySelectorAll on HtmlNode for FORM returns 0 nodes child INPUTs #24

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. var nodes = formNode.QuerySelectorAll("textarea,input,button,select");
2. Console.WriteLine(nodes.Count());

What is the expected output? What do you see instead?

The method should return all child nodes of "formNode" matching the CSS 
selector. 0 nodes are returned at the moment.

Original issue reported on code.google.com by asbjornu on 6 May 2009 at 2:31

GoogleCodeExporter commented 8 years ago
Do you have a test HTML document that you can attach here as a file and where 
this
can be reproduced?

Original comment by azizatif on 6 May 2009 at 2:34

GoogleCodeExporter commented 8 years ago
It looks like this is a bug in the HtmlAgilityPack. I tested with the Google 
home
page, which contains a form with some input fields, and using the attached 
IronPython
script. The result of my interactive test was:

IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import HtmlDocument
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count =  1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count =  8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 0

The problem is that the ChildNodes property of a form returns an empty 
collection! As
a result, Fizzler fails to find anything within the descendants or immediate 
children
of a form.

It looks like this issue is already logged with HtmlAgilityPack (but 
unfortunately
with no resolution):

http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?WorkItemId=21782

Original comment by azizatif on 6 May 2009 at 3:03

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by azizatif on 6 May 2009 at 3:04

GoogleCodeExporter commented 8 years ago
There are alternatives to HtmlAgilityPack available. How much work would it be 
to
drop HtmlAgilityPack and use something else as our default?

We should unit test this problem if we do swap.

Original comment by info%colinramsay.co.uk@gtempaccount.com on 6 May 2009 at 3:08

GoogleCodeExporter commented 8 years ago
> drop HtmlAgilityPack 

I don't suggest dropping it. Just leave it in there as it is, but yet, drop it 
as the
default perhaps if a more robust implementation is available.

> How much work would it be to and use something else as our default?

Shouldn't be a whole lot as long as the other supports a reasonable API 
providing
access to attributes, children and siblings of a node.

Original comment by azizatif on 6 May 2009 at 3:18

GoogleCodeExporter commented 8 years ago
> drop HtmlAgilityPack and use something else as our default?

Now tracked separately as issue #25.

Original comment by azizatif on 6 May 2009 at 3:23

GoogleCodeExporter commented 8 years ago

Original comment by azizatif on 6 May 2009 at 3:24

GoogleCodeExporter commented 8 years ago
Guys, this is not an HTML agility pack comment. Check back the 
http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?
WorkItemId=21782&ProjectName=htmlagilitypack page.

Original comment by simon_mo...@hotmail.com on 18 May 2009 at 6:29

GoogleCodeExporter commented 8 years ago
Thanks Simon. We'll take a look at this as HTMLAgilityPack actually worked just 
fine
apart from this item. At the moment we created a new SgmlReader wrapper and are 
using
that as our default. I could see us receiving further bug reports because of 
this
behaviour, so we might just stick with the SgmlReader as a default but I'll 
re-open
this issue for now.

Original comment by info%colinramsay.co.uk@gtempaccount.com on 18 May 2009 at 8:29

GoogleCodeExporter commented 8 years ago
Simon, thanks for your input on this issue. What would you recommend for the 
value of
HtmlElementFlag for FORM? By default, it seems to be CanOverlap OR Empty. I 
tried by
also turning on the Closed flag and that made it work. That is, with CanOverlap 
OR
Closed OR Empty, one sees INPUT elements appear within descendants of FORM:

IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import *
>>> print HtmlNode.ElementsFlags['form']
10
>>> HtmlNode.ElementsFlags['form'] |= HtmlElementFlag.Closed
>>> print HtmlNode.ElementsFlags['form']
14
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count =  1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count =  8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 1
>>> def dump(node, level = 0):
...     print ' ' * level, node.Name
...     for child in node.ChildNodes:
...         dump(child, level + 1)
...
>>> dump(form)
 form
  table
   tr
    td
     #text
    td
     input
     input
     input
     br
     input
     input
    td
     font
      #text
      a
       #text
      br
      #text
      a
       #text
      br
      #text
      a
       #text
   tr
    td
     font
      span
       #text
       input
       label
        #text
       input
       label
        #text
       input
       label
        #text

Also, I see that this does not affect Fizzler directly, only its clients. It 
does,
however, affect Visual and Console Fizzler utilities, which do happen to be 
Fizzler
clients and perhaps which should now have an option to opt in for on behavior 
or the
other with regard to FORM.

Original comment by azizatif on 18 May 2009 at 8:49

GoogleCodeExporter commented 8 years ago

Original comment by azizatif on 30 Sep 2009 at 11:23

GoogleCodeExporter commented 8 years ago

Original comment by azizatif on 1 Oct 2009 at 6:19

GoogleCodeExporter commented 8 years ago
Fixed in r256.

Original comment by azizatif on 1 Oct 2009 at 6:20

GoogleCodeExporter commented 8 years ago

Original comment by azizatif on 8 Dec 2009 at 11:11

GoogleCodeExporter commented 8 years ago
What alternatives are there to Html Agility Pack and SgmlReader? I have found
SgmlReader to be pretty slow.

Has anyone used HtmlUnit? We could use http://www.ikvm.net/ to convert that 
library
to .net?

Original comment by jake....@gmail.com on 9 Dec 2009 at 4:26

GoogleCodeExporter commented 8 years ago
IKVM.net is excellent and so is HtmlUnit, but the tests I've done show that the 
converted code is awfully slow to initialize and somewhat slower during 
execution 
than the unconverted code. That's not to say a patch from you implementing the 
conversion and HtmlUnit as a DOM engine in Fizzler should be declined, though. 
:)

Original comment by asbjornu on 9 Dec 2009 at 10:27

GoogleCodeExporter commented 8 years ago
I'm still noticing an issue where it seems to return null (no results) when 
passing "form" or an ID of a form (eg. "#form1"), even when using LoadHtml2 or 
in Visual Fizzler.  

Has anyone else still had this issue?

Original comment by mmezza...@gmail.com on 23 Jul 2010 at 4:14

GoogleCodeExporter commented 8 years ago
This issue was closed by revision 073aa958b22b.

Original comment by azizatif on 4 Jan 2013 at 8:31

GoogleCodeExporter commented 8 years ago
This issue was closed by revision 9c7132c82f3c.

Original comment by azizatif on 4 Jan 2013 at 10:56