James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
30 stars 3 forks source link

Enhance XPath Support for Extracting Internal Tag Parameters #40

Closed DioxusGrow closed 3 weeks ago

DioxusGrow commented 3 weeks ago

Is it possible to add support for extracting any internal parameters for tags other than a/@href via xpath? a/@title, //div/@any_tag_parameter

James-LG commented 3 weeks ago

What you're referring to are called attributes, and it should be possible to get any attribute using that syntax.

/@class for example is already tested here. The library doesn't care what the name of the attribute is, it works for all of them.

DioxusGrow commented 3 weeks ago

Didn't notice right away. There are errors in all tag attributes. //ul[contains(@class, \"svelte-6i0owd\")]//a/@href expected TreeNode, got NonTreeNode Below is the working code to reproduce the error.

use reqwest;
use serde::{Deserialize, Serialize};
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree};
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;

#[derive(Serialize, Deserialize, Debug)]
struct TestXpath {
    result: String,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let mut queries: Vec<TestXpath> = vec![];

    let res = reqwest::get("https://finance.yahoo.com/?guccounter=1")
        .await?
        .text()
        .await?;

    // Parse the HTML text
    let document = html::parse(&res)?;
    let xpath_item_tree = XpathItemTree::from(&document);

    // Assuming your XPath string is static, it is safe to use `expect` during parsing
    // let title = xpath::parse("//div[@class=\"content svelte-w27v8j\"]/a/h3")
    let test_xpath = xpath::parse("//ul[contains(@class, \"svelte-6i0owd\")]//a/@data-ylk")
        .expect("xpath is invalid")
        .apply(&xpath_item_tree)?;

    for item in test_xpath.iter() {
        let res = TestXpath {
            result: item
                .extract_as_node()
                .extract_as_tree_node()
                .text(&xpath_item_tree)
                .unwrap(),
        };
        queries.push(res);
    }

    // Serialize it to a JSON string.
    let test_query = serde_json::to_string(&queries)?;

    let mut file = File::create("output.json")?;
    file.write_all(test_query.as_bytes())?;

    Ok(())
}
James-LG commented 3 weeks ago

That's because in 0.6.0 AttributeNodes were NonTreeNodes. This has been simplified in 0.7.0-beta where there is no distinction between tree and non-tree nodes. Also note that you shouldn't be calling text() on an attribute node though. It has a value, it doesn't contain any text nodes.

Here are the docs for this exact use case for v0.7.0-beta.0.

DioxusGrow commented 3 weeks ago

With this example it works fine:

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.7.0-beta.0"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }
use reqwest;
use serde::{Deserialize, Serialize};
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree};
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
use tokio::time::Duration;

#[derive(Serialize, Deserialize, Debug)]
struct TestXpath<'a> {
    result: &'a str,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let mut queries: Vec<TestXpath> = vec![];

    let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30)) // Set a 30-second timeout
        .build()?;

    let res = client
        .get("https://finance.yahoo.com/?guccounter=1")
        .send()
        .await?;

    // Parse the HTML text
    let document = html::parse(&res.text().await?)?;
    let xpath_item_tree = XpathItemTree::from(&document);

    let test_xpath = xpath::parse("//ul[@class=\"story-items svelte-6i0owd\"]//a/@href")
        .expect("xpath is invalid")
        .apply(&xpath_item_tree)?;

    for item in test_xpath.iter() {
        let res = TestXpath {
            result: &item.extract_as_node().extract_as_attribute_node().value,
        };
        queries.push(res);
    }

    // Serialize it to a JSON string.
    let test_query = serde_json::to_string(&queries)?;

    let mut file = File::create("output.json")?;
    file.write_all(test_query.as_bytes())?;

    Ok(())
}