Closed atifaziz closed 4 years ago
The following code reproduces the problem:
var doc = new HtmlDocument();
doc.LoadHtml(@"
<!doctype html>
<html>
<head>
<title>Lorem Ipsum</title>
</head>
<body>
<p class='a
b
c'>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>
");
foreach (var selector in new[] { ".a", ".b", ".c" })
Console.WriteLine($"{selector} = {doc.DocumentNode.QuerySelectorAll(selector).Count()}");
The output is:
.a = 0
.b = 0
.c = 1
when it should be:
.a = 1
.b = 1
.c = 1
The specification for class selectors says:
Working with HTML, authors may use the "period" notation (also known as "full stop", U+002E, .) as an alternative to the ~= notation when representing the class attribute.
Later in section 6.3.1 (Attribute presence and value selectors), it says:
[att~=val]
Represents an element with the att attribute whose value is a whitespace-separated list of words, one of which is exactly "val". If "val" contains whitespace, it will never represent anything (since the words are separated by spaces). Also if "val" is the empty string, it will never represent anything.
and where whitespace is defined as:
Only the characters "space" (U+0020), "tab" (U+0009), "line feed" (U+000A), "carriage return" (U+000D), and "form feed" (U+000C) can occur in whitespace. Other space-like characters, such as "em-space" (U+2003) and "ideographic space" (U+3000), are never part of whitespace.
Clearly, we incorrectly separate on space (U+0020) only:
Originally linked page http://shoryuken.com/forum/index.php?events/monthly
no longer exists but can be retrieved from the Internet Archive using https://web.archive.org/web/20120508205646/http://shoryuken.com/forum/index.php?events/monthly instead.
The same bug exists with the [att~=val]
attribute selector:
var doc = new HtmlDocument();
doc.LoadHtml(@"
<!doctype html>
<html>
<head>
<title>Lorem Ipsum</title>
</head>
<body>
<p class='a
b
c'>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>
");
foreach (var selector in new[] { "[class~=a]", "[class~=b]", "[class~=c]" })
Console.WriteLine($"{selector} = {doc.DocumentNode.QuerySelectorAll(selector).Count()}");
prints:
[class~=a] = 0
[class~=b] = 0
[class~=c] = 1
What steps will reproduce the problem?
HtmlDocument
from http://shoryuken.com/forum/index.php?events/monthlydocument.CssSelect("td.primaryContent.weekends.nowWeek.nowToday")
What is the expected output? What do you see instead?
I expect one TD element to be returned. However, there are tabs, carriage returns, and linefeeds in the class attribute on the tag, and only the first class selector (
td.primaryContent
) works.What version of the product are you using? On what operating system?
1.0.0.0 - Windows 7 Please provide any additional information below.
Originally reported on Google Code with ID 51
Reported by
casperOne@caspershouse.com
on 2012-05-20 07:48:08