Closed DioxusGrow closed 3 weeks ago
What you're referring to are called attributes, and it should be possible to get any attribute using that syntax.
/@class
for example is already tested here. The library doesn't care what the name of the attribute is, it works for all of them.
Didn't notice right away. There are errors in all tag attributes.
//ul[contains(@class, \"svelte-6i0owd\")]//a/@href
expected TreeNode, got NonTreeNode
Below is the working code to reproduce the error.
use reqwest;
use serde::{Deserialize, Serialize};
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree};
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
#[derive(Serialize, Deserialize, Debug)]
struct TestXpath {
result: String,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let mut queries: Vec<TestXpath> = vec![];
let res = reqwest::get("https://finance.yahoo.com/?guccounter=1")
.await?
.text()
.await?;
// Parse the HTML text
let document = html::parse(&res)?;
let xpath_item_tree = XpathItemTree::from(&document);
// Assuming your XPath string is static, it is safe to use `expect` during parsing
// let title = xpath::parse("//div[@class=\"content svelte-w27v8j\"]/a/h3")
let test_xpath = xpath::parse("//ul[contains(@class, \"svelte-6i0owd\")]//a/@data-ylk")
.expect("xpath is invalid")
.apply(&xpath_item_tree)?;
for item in test_xpath.iter() {
let res = TestXpath {
result: item
.extract_as_node()
.extract_as_tree_node()
.text(&xpath_item_tree)
.unwrap(),
};
queries.push(res);
}
// Serialize it to a JSON string.
let test_query = serde_json::to_string(&queries)?;
let mut file = File::create("output.json")?;
file.write_all(test_query.as_bytes())?;
Ok(())
}
That's because in 0.6.0 AttributeNodes
were NonTreeNodes
. This has been simplified in 0.7.0-beta where there is no distinction between tree and non-tree nodes. Also note that you shouldn't be calling text()
on an attribute node though. It has a value, it doesn't contain any text nodes.
Here are the docs for this exact use case for v0.7.0-beta.0.
With this example it works fine:
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.7.0-beta.0"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }
use reqwest;
use serde::{Deserialize, Serialize};
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree};
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
use tokio::time::Duration;
#[derive(Serialize, Deserialize, Debug)]
struct TestXpath<'a> {
result: &'a str,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let mut queries: Vec<TestXpath> = vec![];
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30)) // Set a 30-second timeout
.build()?;
let res = client
.get("https://finance.yahoo.com/?guccounter=1")
.send()
.await?;
// Parse the HTML text
let document = html::parse(&res.text().await?)?;
let xpath_item_tree = XpathItemTree::from(&document);
let test_xpath = xpath::parse("//ul[@class=\"story-items svelte-6i0owd\"]//a/@href")
.expect("xpath is invalid")
.apply(&xpath_item_tree)?;
for item in test_xpath.iter() {
let res = TestXpath {
result: &item.extract_as_node().extract_as_attribute_node().value,
};
queries.push(res);
}
// Serialize it to a JSON string.
let test_query = serde_json::to_string(&queries)?;
let mut file = File::create("output.json")?;
file.write_all(test_query.as_bytes())?;
Ok(())
}
Is it possible to add support for extracting any internal parameters for tags other than a/@href via xpath? a/@title, //div/@any_tag_parameter