atifaziz / Hazz

CSS Selectors (via Fizzler) for HtmlAgilityPack (HAP)
Other
63 stars 7 forks source link

Class & "[att~=val]" selectors don't work when whitespace is not just spaces #14

Closed atifaziz closed 4 years ago

atifaziz commented 9 years ago

What steps will reproduce the problem?

  1. Get HtmlDocument from http://shoryuken.com/forum/index.php?events/monthly
  2. Use document.CssSelect("td.primaryContent.weekends.nowWeek.nowToday")

    What is the expected output? What do you see instead?

I expect one TD element to be returned. However, there are tabs, carriage returns, and linefeeds in the class attribute on the tag, and only the first class selector (td.primaryContent) works.

What version of the product are you using? On what operating system?

1.0.0.0 - Windows 7 Please provide any additional information below.


Originally reported on Google Code with ID 51

Reported by casperOne@caspershouse.com on 2012-05-20 07:48:08

atifaziz commented 4 years ago

The following code reproduces the problem:

var doc = new HtmlDocument();
doc.LoadHtml(@"
<!doctype html>
<html>
<head>
    <title>Lorem Ipsum</title>
</head>
<body>
    <p class='a
              b
              c'>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>
");
foreach (var selector in new[] { ".a", ".b", ".c" })
    Console.WriteLine($"{selector} = {doc.DocumentNode.QuerySelectorAll(selector).Count()}");

The output is:

.a = 0
.b = 0
.c = 1

when it should be:

.a = 1
.b = 1
.c = 1

The specification for class selectors says:

Working with HTML, authors may use the "period" notation (also known as "full stop", U+002E, .) as an alternative to the ~= notation when representing the class attribute.

Later in section 6.3.1 (Attribute presence and value selectors), it says:

[att~=val] Represents an element with the att attribute whose value is a whitespace-separated list of words, one of which is exactly "val". If "val" contains whitespace, it will never represent anything (since the words are separated by spaces). Also if "val" is the empty string, it will never represent anything.

and where whitespace is defined as:

Only the characters "space" (U+0020), "tab" (U+0009), "line feed" (U+000A), "carriage return" (U+000D), and "form feed" (U+000C) can occur in whitespace. Other space-like characters, such as "em-space" (U+2003) and "ideographic space" (U+3000), are never part of whitespace.

Clearly, we incorrectly separate on space (U+0020) only:

https://github.com/atifaziz/Hazz/blob/afa6790425882c3cace64d7026703f1de8921802/src/HtmlNodeOps.cs#L71-L74

atifaziz commented 4 years ago

Originally linked page http://shoryuken.com/forum/index.php?events/monthly no longer exists but can be retrieved from the Internet Archive using https://web.archive.org/web/20120508205646/http://shoryuken.com/forum/index.php?events/monthly instead.

atifaziz commented 4 years ago

The same bug exists with the [att~=val] attribute selector:

var doc = new HtmlDocument();
doc.LoadHtml(@"
<!doctype html>
<html>
<head>
    <title>Lorem Ipsum</title>
</head>
<body>
    <p class='a
              b
              c'>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>
");
foreach (var selector in new[] { "[class~=a]", "[class~=b]", "[class~=c]" })
    Console.WriteLine($"{selector} = {doc.DocumentNode.QuerySelectorAll(selector).Count()}");

prints:

[class~=a] = 0
[class~=b] = 0
[class~=c] = 1