James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
30 stars 3 forks source link

xpath: how do you select the text of a node? #17

Closed kesmit13 closed 6 months ago

kesmit13 commented 11 months ago

I want to be able to select the text of a node using XPath such as the following:

//div/text()

The text() selector doesn't appear to work though. Actually, in my case, I want the text of all of the descendant nodes which I would normally get using:

//div//text()
James-LG commented 11 months ago

Similar issue to #15

Only nodes can be selected with XPath at this time, attributes and then text are next on my todo list in that order.

But in the mean time, you can select the nodes with XPath, and use the library to get the text, which should cover your use cases.

get_all_text get_text

James-LG commented 11 months ago

Your second use case of //div//text() would probably have to use an iteration at the moment.

    // Parse the text into a document.
    let text = r##"
        <parent>
            <div>
                hi
                <span>foo</span>
            </div>
            <div>
                bye
                <a>bar</a>
            </div>
        </parent>"##;
    let document = html::parse(text).unwrap();

    // Create and apply the xpath.
    let xpath = xpath::parse("//div").unwrap();
    let results = xpath.apply(&document).unwrap();

    // Collect the text of all nodes.
    let text: Vec<String> = results
        .into_iter()
        .filter_map(|n| n.get_all_text(&document))
        .collect();

    // Assertions
    assert_eq!("hi foo", text[0]);
    assert_eq!("bye bar", text[1]);
kesmit13 commented 11 months ago

Unfortunately, I'm trying to use XPath expressions in Tera templates, so I don't have access to the API directly, only the results of expressions. At the moment, I'm converting the result to a JSON object with a text attribute containing that content.

James-LG commented 11 months ago

Not sure I understand your use case. If you can call the skyscraper functions to parse the html and xpath, and then apply the xpath to the html, then why can't you also call get_all_text on the result? I'm not familiar with Tera, but a skim of their docs makes it seem like you can call any arbitrary Rust code with a Tera function.

If the /text() xpath function worked, the library would return a Vec of DocumentNodes that all happen to represent text, which still requires iteration and honestly will likely be harder to deal with than the get_all_text method. Just trying to understand what exactly your restrictions are.

kesmit13 commented 11 months ago

In my implementation, I'm converting the result of an xpath query to a JSON object so that it can be traversed by Tera at that point. Basically, something like this:

{{ q(path='/root/book') | get(key='text') | json_encode | safe }}

It would be possible to do what you are saying if I returned a DocumentNode and wrote other Tera functions to extract the information from the nodes using those new functions, but I was hoping to leave all of that to your xpath library.

As far as the Vec of DocumentNodes being returned, that's not an issue because Tera has functions to deal with arrays as well. my code would automatically convert Vec<DocumentNode> to a serde_json::Value::Array<String> which Tera can traverse directly, or you could use one of the array functions to join all of them:

{{ q(path='/root/book//text()' | join(sep=' ') | json_encode | safe }}