James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
31 stars 4 forks source link

Simple api #39

Open DioxusGrow opened 3 months ago

DioxusGrow commented 3 months ago

I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.

I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.

In the readme.md file: How to install the crate cargo add skyscraper

Dependencies in the cargo.toml file

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.6.4"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }

And functions with clear names use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr}; Find and SelectAttr return vec of values. FindOne and SelectOneAttr return &str value.

As well as a similar API with very simple, understandable examples: From url 1) Load HTML document from url. Default timeout is 30s let doc = Query::url("http://example.com/").expect("");

2) Load HTML document from URL with client settings let doc = Query::url_client("http://example.com/", &client).expect("");

From file

let file_path = "/home/user/sample.html";
let doc = Query::file(file_path).expect("");

From text

let text = r#"<html>....</html>"#;
let doc = Query::text(text).expect("");

Also, add a Find and FindOne function: Find all A elements. let list = Find(&doc, "//a").expect("");

Find all A elements that have an href attribute. let list = Find(&doc, "//a[@href]").expect("");

Find all A elements with href attribute and only return all links. let list = Find(&doc, "//a/@href").expect("");

Find the first A element. let a = FindOne(&doc, "//a[1]").expect("");

Find the third A element. let a = FindOne(&doc, "//a[3]").expect("");

Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath: //a/@href //div/@inner_parameter

-- Select all Attribute

let attr = SelectAttr(&doc, "//img", "src").expect("");

Select one Attribute

let attr = SelectOneAttr(&doc, "//img[1]", "src").expect("");

-- Get count of elements.

let list = Find(&doc, "//a").expect("");
let count = list.len();

But this is just a subjective example of an API that looks simple and understandable.

James-LG commented 3 months ago

I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust.

I'll start by saying I was in the exact same boat 4 years ago, except I was insane enough to hack together this entire library just to scrape some websites in Rust using xpath expressions. As such I wrote only what I needed, and I had no regard for the official xpath or html specification.

Later I decided some other people might find it handy so I open sourced it. A few people eventually found this library and started asking for other xpath features, but the library wasn't written in a way that allowed those features to be added easily, so I rewrote the entire xpath module about 6 months ago, this time following the specification as close as I could.

This complete rewrite of the xpath module is why the XpathItemTree exists at the moment. The tree follows the xpath specification, but I didn't have time to rewrite the html module, so the tree became a bridge from the old HtmlDocument to the new XpathItemTree. Ideally the html module would directly return what is now called the XpathItemTree but could be called HtmlDocument.

I may attempt to rewrite the html module to bring them in line soon, but it's a big change that will take time.

Load HTML document from url. Default timeout is 30s

No chance I'm adding http requests to this library, simply because there are too many existing http crates in the Rust ecosystem to make a choice for you. Especially since some are async and some are not. This library is fairly light on dependencies and I'd like to keep it that way.

From file

Seem reasonable to add.


Every other example you gave is already possible, but maybe a bit verbose, so I'll look into adding more concise functions.

DioxusGrow commented 3 months ago

What you have done for the community is worthy of admiration and approval. A huge thank you to you. A simple API is just a light and understandable syntactic sugar or a functional wrapper that any schoolchild can take and immediately get a result. Even without knowing Rust. Similarity with the API antchfx/htmlquery will make the transition from the Go community to Rust easy. Easy retrieval of the document for processing and clear obtaining of the result, which even a grandmother can understand.

As for the HTTP client, the discussion here is not so much about dependencies, but about the completeness of the tool itself. Essentially, there is a simple GET request to the site and a timeout if the site does not respond. In the expect() - either the timeout has expired, or the server error code. In the second case, a simple setup of the client by the crate user and passing it to the function. And this will already raise the crate to a new level. In antchfx/htmlquery, it was necessary to write the client yourself due to the inability to change timeouts, etc. After all, the main task of the crate is to parse HTML pages that are not in a file and not in the code, but on the internet, and it turns out that there is no main tool for getting the page from the internet. You will see for yourself how the presence of this main tool will lead to even greater popularity and use by the community.

DioxusGrow commented 3 months ago

You may be right about different clients, just then you need to add to the documentation how to create a document for processing and a full example of querying and getting the result. For example:

  1. Load HTML document from url. Default timeout is 30s

    let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30)) // Set a 30-second timeout
        .build()?;
    
    let res = client
        .get("https://finance.yahoo.com/?guccounter=1")
        .send()
        .await?;
    
    // Parse the HTML text
    let doc = html::parse(&res.text().await?)?;

    And full example:

    [dependencies]
    serde = { version = "1.0", features = ["derive"] }
    serde_json = "1.0"
    skyscraper = "0.7.0-beta.0"
    reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
    tokio = { version = "1", features = ["full"] }
    
    use reqwest;
    use serde::{Deserialize, Serialize};
    use skyscraper::html;
    use skyscraper::xpath::{self, XpathItemTree};
    use std::error::Error;
    use std::fs::File;
    use std::io::prelude::*;
    use tokio::time::Duration;

[derive(Serialize, Deserialize, Debug)]

struct TestXpath<'a> { result: &'a str, }

[tokio::main]

async fn main() -> Result<(), Box> { let mut queries: Vec = vec![];

let client = reqwest::Client::builder()
    .timeout(Duration::from_secs(30)) // Set a 30-second timeout
    .build()?;

let res = client
    .get("https://finance.yahoo.com/?guccounter=1")
    .send()
    .await?;

// Parse the HTML text
let doc = html::parse(&res.text().await?)?;
let xpath_item_tree = XpathItemTree::from(&doc);

// Assuming your XPath string is static, it is safe to use `expect` during parsing
let test_xpath = xpath::parse("//ul[@class=\"story-items svelte-6i0owd\"]//a/@href")
    .expect("xpath is invalid")
    .apply(&xpath_item_tree)?;

for item in test_xpath.iter() {
    let res = TestXpath {
        result: &item.extract_as_node().extract_as_attribute_node().value,
    };
    queries.push(res);
}

// Serialize it to a JSON string.
let test_query = serde_json::to_string(&queries)?;

let mut file = File::create("output.json")?;
file.write_all(test_query.as_bytes())?;

Ok(())

}

James-LG commented 3 months ago

v0.7.0-beta.1 has addressed some of these feature requests. See #42 for details.