James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
30 stars 3 forks source link

BREAKING: Complete xpath module rewrite #24

Closed James-LG closed 6 months ago

James-LG commented 6 months ago

The goal of this rewrite is to bring the implementation of the xpath module in line with the official xpath specification as defined in https://www.w3.org/TR/2017/REC-xpath-31-20170321/.

The main advantage of doing this is that it makes supporting more features is easier when you can follow the spec (obviously!).

One of the main limitations of the old xpath module was that it could only return "Text" or "Tag" nodes, which means there's no way to select other things that xpath supports like attributes. This rewrite makes that possible, at the cost of some added complexity on the return types.

Fixes #17 Fixes #15

It also fixes indexing which was previously being applied to the total set of items after every step, rather than per parent node, as mentioned in #21.

James-LG commented 6 months ago

TODO:

James-LG commented 6 months ago

Migration Guide Draft

Item Type

The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.

Below is an overview of the returned item type XpathItem:

/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    ///
    ///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
    Node(Node<'tree>),

    /// A function item.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
    Function(Function),

    /// An atomic value.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
    AnyAtomicType(AnyAtomicType),
}

/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    TreeNode(XpathItemTreeNode<'tree>),

    /// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
    NonTreeNode(NonTreeXpathNode),
}

/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
    /// An attribute node.
    AttributeNode(AttributeNode),

    /// A namespace node.
    NamespaceNode(NamespaceNode),
}

/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
    id: NodeId,

    /// The data associated with this node.
    pub data: &'a XpathItemTreeNodeData,
}

/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
    /// The root node of the document.
    DocumentNode(XpathDocumentNode),

    /// An element node.
    ///
    /// HTML tags are represented as element nodes.
    ElementNode(ElementNode),

    /// A processing instruction node.
    PINode(PINode),

    /// A comment node.
    CommentNode(CommentNode),

    /// A text node.
    TextNode(TextNode),
}

Xpath Item Tree

To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree rather than an HtmlDocument.

XpathItemTree implements From<&HtmlDocument>, so you can easily generate an XpathItemTree from a reference to an HtmlDocument. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument if possible.

let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;

Getting Text

Text nodes are a type of TreeNode. You can either match on the item, or use these convenient as_[variant] functions.

Other changes:

- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);

Getting Attributes

Attribute nodes are a type of NonTreeNode. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode.

- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();

or alternatively, use xpath to select the attribute node

- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;