causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.79k stars 98 forks source link

Implement Send for ElementRef #175

Closed 0xksure closed 4 months ago

0xksure commented 4 months ago

Problem

I have parsed a list of

and I want to parse the content of each article in separate threads. However, this is not possible since ElementRef doesn't implement Send thus it's not possible to share it across threads.

Solution

Implement Send for ElementRef. Now, I'm not an expert in Rust but rather a novice so if there's a way I can solve this without having to touch your code then that would be awesome.

The code which reports the missing Send:

    let html = reqwest::blocking::get(url)
        .expect("Could not make request")
        .text()
        .expect("Could not read response text");

    // Parse the HTML
    let document = Html::parse_document(&html);

    // Create a Selector to find <article> tags
    // Iterate over each <article> element
    let article_selector = Selector::parse("article").expect("Could not parse article selector");
    let (tx, rx) = mpsc::channel();
    let docments = document.select(&article_selector);
    let num_articles: usize = docments.count();

    let articles = Arc::new(Mutex::new(docments));
    for i in 0..num_articles {
        thread::spawn(move || {
            let a_selector = Selector::parse("a").unwrap();
            let art = articles.clone().lock().unwrap().nth(i).unwrap();
            let a_text = art
                .select(&a_selector)
                .map(|n| n.value().attr("href").unwrap_or(""))
                .fold(String::from(""), |a, f| {
                    return a + &f;
                });
            tx.send(a_text).unwrap();
        });
    }
adamreichold commented 4 months ago

This cannot be implement as ElementRef<'a> contains essentialy &'a Node where Node is not Sync, i.e. &'a Node and hence ElementRef<'a> cannot be Send.

I think the only way to make this work would be

adamreichold commented 4 months ago

The following appears to compile

/*
[package]
name = "test"
version = "0.1.0"
edition = "2021"

[dependencies]
reqwest = { version = "0.11.24", features = ["blocking"] }
scraper = { version = "0.19.0", features = ["atomic"] }
*/
use std::sync::mpsc;
use std::thread;

use scraper::*;

fn main() {
    let html = reqwest::blocking::get("TODO")
        .expect("Could not make request")
        .text()
        .expect("Could not read response text");

    // Parse the HTML
    let document = Html::parse_document(&html);

    // Create a Selector to find <article> tags
    // Iterate over each <article> element
    let article_selector = Selector::parse("article").expect("Could not parse article selector");
    let document_ids = document
        .select(&article_selector)
        .map(|element| element.id())
        .collect::<Vec<_>>();

    let (tx, rx) = mpsc::channel();

    for document_id in document_ids {
        let document = document.clone();
        let tx = tx.clone();
        thread::spawn(move || {
            let a_selector = Selector::parse("a").unwrap();
            let art = ElementRef::wrap(document.tree.get(document_id).unwrap()).unwrap();
            let a_text = art
                .select(&a_selector)
                .map(|n| n.value().attr("href").unwrap_or(""))
                .fold(String::from(""), |a, f| {
                    return a + &f;
                });
            tx.send(a_text).unwrap();
        });
    }

    for a_text in rx {
        dbg!(a_text);
    }
}
adamreichold commented 4 months ago

But please note that the overhead of this is significant and would strongly advise to benchmark whether this will actually result in a speed-up.

0xksure commented 4 months ago

But please note that the overhead of this is significant and would strongly advise to benchmark whether this will actually result in a speed-up.

How come? I guess getting the documentIds ahead of time would lead to even more operations on the document. Maybe it would just be faster to parse the in the first document map. I found it takes a lot of time to parse a handful of tags per

adamreichold commented 4 months ago

How come? I guess getting the documentIds ahead of time would lead to even more operations on the document. Maybe it would just be faster to parse the in the first document map. I found it takes a lot of time to parse a handful of tags per

What I meant is the cost of cloning the whole DOM and starting several threads usually will be higher than the cost of evaluating a few selectors on a single thread.

With just 50 articles in one document it can take several seconds.

This does not sound reasonable. Are you sure you are building/running your code with the --release argument passed to Cargo? Or maybe this is the network overhead instead of the processing duration?

0xksure commented 4 months ago

What I meant is the cost of cloning the whole DOM and starting several threads usually will be higher than the cost of evaluating a few selectors on a single thread.

Yes absolutely. If the Document type implemented Send then it would be possible to wrap it in Arc and send it across threads.

This does not sound reasonable. Are you sure you are building/running your code with the --release argument passed to Cargo? Or maybe this is the network overhead instead of the processing duration?

I'm not running with --release, can try it. Doubt it is the latency, but can benchmark it.

adamreichold commented 4 months ago

If the Document type implemented Send then it would be possible to wrap it in Arc and send it across threads.

This is not how it works. Arc<Document> is Send only if Document is Send + Sync as sending Arc<Document> does expose short-lived &Document instances on multiple threads. This exactly the same reason why ElementRef is not Send.

I'm not running with --release, can try it. Doubt it is the latency, but can benchmark it.

Rust is extremely - even more so than say C++ - dependent on optimizations to produce reasonable performance as most abstractions are "compiled away". Performance investigations without optimizations turned on are pretty much pointless.

0xksure commented 4 months ago

This is not how it works. Arc<Document> is Send only if Document is Send + Sync as sending Arc<Document> does expose short-lived &Document instances on multiple threads. This exactly the same reason why ElementRef is not Send.

Yes so if Document is Send+Sync then each thread would be able to hold one pointer to Document. The same goes for ElementRef. That's why I created the issue, because it would be great to have Send+Sync on ElementRef so that I could use the Article in each thread. Or is there a reason for why this isn't reasonable?

Rust is extremely - even more so than say C++ - dependent on optimizations to produce reasonable performance as most abstractions are "compiled away". Performance investigations without optimizations turned on are pretty much pointless.

True, I will try to run with optimizations.

adamreichold commented 4 months ago

Or is there a reason for why this isn't reasonable?

Because the underlying data structures do not support it, most importantly even with the atomic feature enabled, StrTendril contains Cell<NonZeroUsize>, i.e. uses interior mutability that is not thread-safe.

(It does become Send though, so you can copy a Html instance and move that to another thread. But not a reference to a Html owned by the originating thread because it could destroy the data and the references on the receiving thread would need to be synchronized against that.)

adamreichold commented 4 months ago

And note that these design decisions are targeting better single-threaded performance which is usually the better choice because even without these limitations, having multiple threads process separate documents instead of sharing a single document among them is usually more efficient due to better cache locality.

cfvescovo commented 4 months ago

And note that these design decisions are targeting better single-threaded performance which is usually the better choice because even without these limitations, having multiple threads process separate documents instead of sharing a single document among them is usually more efficient due to better cache locality.

Straight to the point, couldn't have said that better!

cfvescovo commented 4 months ago

As @adamreichold wrote, you should process different documents in different threads. Even downloading them concurrently will generally improve performance.

If you only have a single document, parsing it sequentially will almost always be fast enough. If you wanted to improve performance, you should compile with the release flag and, in extreme cases, even enable full LTO

cfvescovo commented 4 months ago

I don't really know how much of a difference enabling full LTO will make. It will take a lot more time to compile, though. Sometimes it will be worth it, sometimes it won't

0xksure commented 4 months ago

Because the underlying data structures do not support it, most importantly even with the atomic feature enabled, StrTendril contains Cell<NonZeroUsize>, i.e. uses interior mutability that is not thread-safe.

Yes this is basically the error I receive.

(It does become Send though, so you can copy a Html instance and move that to another thread. But not a reference to a Html owned by the originating thread because it could destroy the data and the references on the receiving thread would need to be synchronized against that.)

Copying the html instance is not an option as that more or less goes against the purpose of rust.

adamreichold commented 4 months ago

This is "closed as not planned" then I guess.

0xksure commented 4 months ago

This is "closed as not planned" then I guess.

Yes close it. Thanks!