Closed James-LG closed 6 months ago
TODO:
The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.
Below is an overview of the returned item type XpathItem
:
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-node
Node(Node<'tree>),
/// A function item.
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
Function(Function),
/// An atomic value.
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
AnyAtomicType(AnyAtomicType),
}
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
TreeNode(XpathItemTreeNode<'tree>),
/// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
NonTreeNode(NonTreeXpathNode),
}
/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
/// An attribute node.
AttributeNode(AttributeNode),
/// A namespace node.
NamespaceNode(NamespaceNode),
}
/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
id: NodeId,
/// The data associated with this node.
pub data: &'a XpathItemTreeNodeData,
}
/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
/// The root node of the document.
DocumentNode(XpathDocumentNode),
/// An element node.
///
/// HTML tags are represented as element nodes.
ElementNode(ElementNode),
/// A processing instruction node.
PINode(PINode),
/// A comment node.
CommentNode(CommentNode),
/// A text node.
TextNode(TextNode),
}
To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree
rather than an HtmlDocument
.
XpathItemTree
implements From<&HtmlDocument>
, so you can easily generate an XpathItemTree
from a reference to an HtmlDocument
. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument
if possible.
let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;
Text nodes are a type of TreeNode
. You can either match
on the item, or use these convenient as_[variant]
functions.
Other changes:
get_text
to just text
, and get_all_text
to all_text
.String
rather than an Option<String>
.- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);
Attribute nodes are a type of NonTreeNode
. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode
.
- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();
or alternatively, use xpath to select the attribute node
- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;
The goal of this rewrite is to bring the implementation of the xpath module in line with the official xpath specification as defined in https://www.w3.org/TR/2017/REC-xpath-31-20170321/.
The main advantage of doing this is that it makes supporting more features is easier when you can follow the spec (obviously!).
One of the main limitations of the old xpath module was that it could only return "Text" or "Tag" nodes, which means there's no way to select other things that xpath supports like attributes. This rewrite makes that possible, at the cost of some added complexity on the return types.
Fixes #17 Fixes #15
It also fixes indexing which was previously being applied to the total set of items after every step, rather than per parent node, as mentioned in #21.