Closed 0xksure closed 4 months ago
This cannot be implement as ElementRef<'a>
contains essentialy &'a Node
where Node
is not Sync
, i.e. &'a Node
and hence ElementRef<'a>
cannot be Send
.
I think the only way to make this work would be
atomic
feature of scraper
so that StrTendril
at least becomes Send
at cost of single-threaded performance.clone()
the whole document
for each thread, i.e. Send
an instance of Html
there instead of a reference into oneThe following appears to compile
/*
[package]
name = "test"
version = "0.1.0"
edition = "2021"
[dependencies]
reqwest = { version = "0.11.24", features = ["blocking"] }
scraper = { version = "0.19.0", features = ["atomic"] }
*/
use std::sync::mpsc;
use std::thread;
use scraper::*;
fn main() {
let html = reqwest::blocking::get("TODO")
.expect("Could not make request")
.text()
.expect("Could not read response text");
// Parse the HTML
let document = Html::parse_document(&html);
// Create a Selector to find <article> tags
// Iterate over each <article> element
let article_selector = Selector::parse("article").expect("Could not parse article selector");
let document_ids = document
.select(&article_selector)
.map(|element| element.id())
.collect::<Vec<_>>();
let (tx, rx) = mpsc::channel();
for document_id in document_ids {
let document = document.clone();
let tx = tx.clone();
thread::spawn(move || {
let a_selector = Selector::parse("a").unwrap();
let art = ElementRef::wrap(document.tree.get(document_id).unwrap()).unwrap();
let a_text = art
.select(&a_selector)
.map(|n| n.value().attr("href").unwrap_or(""))
.fold(String::from(""), |a, f| {
return a + &f;
});
tx.send(a_text).unwrap();
});
}
for a_text in rx {
dbg!(a_text);
}
}
But please note that the overhead of this is significant and would strongly advise to benchmark whether this will actually result in a speed-up.
But please note that the overhead of this is significant and would strongly advise to benchmark whether this will actually result in a speed-up.
How come? I guess getting the documentIds ahead of time would lead to even more operations on the document. Maybe it would just be faster to parse the in the first document map. I found it takes a lot of time to parse a handful of tags per
How come? I guess getting the documentIds ahead of time would lead to even more operations on the document. Maybe it would just be faster to parse the in the first document map. I found it takes a lot of time to parse a handful of tags per
What I meant is the cost of cloning the whole DOM and starting several threads usually will be higher than the cost of evaluating a few selectors on a single thread.
With just 50 articles in one document it can take several seconds.
This does not sound reasonable. Are you sure you are building/running your code with the --release
argument passed to Cargo? Or maybe this is the network overhead instead of the processing duration?
What I meant is the cost of cloning the whole DOM and starting several threads usually will be higher than the cost of evaluating a few selectors on a single thread.
Yes absolutely. If the Document type implemented Send then it would be possible to wrap it in Arc and send it across threads.
This does not sound reasonable. Are you sure you are building/running your code with the
--release
argument passed to Cargo? Or maybe this is the network overhead instead of the processing duration?
I'm not running with --release, can try it. Doubt it is the latency, but can benchmark it.
If the Document type implemented Send then it would be possible to wrap it in Arc and send it across threads.
This is not how it works. Arc<Document>
is Send
only if Document
is Send + Sync
as sending Arc<Document>
does expose short-lived &Document
instances on multiple threads. This exactly the same reason why ElementRef
is not Send
.
I'm not running with --release, can try it. Doubt it is the latency, but can benchmark it.
Rust is extremely - even more so than say C++ - dependent on optimizations to produce reasonable performance as most abstractions are "compiled away". Performance investigations without optimizations turned on are pretty much pointless.
This is not how it works.
Arc<Document>
isSend
only ifDocument
isSend + Sync
as sendingArc<Document>
does expose short-lived&Document
instances on multiple threads. This exactly the same reason whyElementRef
is notSend
.
Yes so if Document is Send+Sync then each thread would be able to hold one pointer to Document. The same goes for ElementRef. That's why I created the issue, because it would be great to have Send+Sync on ElementRef so that I could use the Article in each thread. Or is there a reason for why this isn't reasonable?
Rust is extremely - even more so than say C++ - dependent on optimizations to produce reasonable performance as most abstractions are "compiled away". Performance investigations without optimizations turned on are pretty much pointless.
True, I will try to run with optimizations.
Or is there a reason for why this isn't reasonable?
Because the underlying data structures do not support it, most importantly even with the atomic
feature enabled, StrTendril
contains Cell<NonZeroUsize>
, i.e. uses interior mutability that is not thread-safe.
(It does become Send
though, so you can copy a Html
instance and move that to another thread. But not a reference to a Html
owned by the originating thread because it could destroy the data and the references on the receiving thread would need to be synchronized against that.)
And note that these design decisions are targeting better single-threaded performance which is usually the better choice because even without these limitations, having multiple threads process separate documents instead of sharing a single document among them is usually more efficient due to better cache locality.
And note that these design decisions are targeting better single-threaded performance which is usually the better choice because even without these limitations, having multiple threads process separate documents instead of sharing a single document among them is usually more efficient due to better cache locality.
Straight to the point, couldn't have said that better!
As @adamreichold wrote, you should process different documents in different threads. Even downloading them concurrently will generally improve performance.
If you only have a single document, parsing it sequentially will almost always be fast enough. If you wanted to improve performance, you should compile with the release flag and, in extreme cases, even enable full LTO
I don't really know how much of a difference enabling full LTO will make. It will take a lot more time to compile, though. Sometimes it will be worth it, sometimes it won't
Because the underlying data structures do not support it, most importantly even with the
atomic
feature enabled,StrTendril
containsCell<NonZeroUsize>
, i.e. uses interior mutability that is not thread-safe.
Yes this is basically the error I receive.
(It does become
Send
though, so you can copy aHtml
instance and move that to another thread. But not a reference to aHtml
owned by the originating thread because it could destroy the data and the references on the receiving thread would need to be synchronized against that.)
Copying the html instance is not an option as that more or less goes against the purpose of rust.
This is "closed as not planned" then I guess.
This is "closed as not planned" then I guess.
Yes close it. Thanks!
Problem
I have parsed a list of and I want to parse the content of each article in separate threads. However, this is not possible since ElementRef doesn't implement Send thus it's not possible to share it across threads.
Solution
Implement Send for ElementRef. Now, I'm not an expert in Rust but rather a novice so if there's a way I can solve this without having to touch your code then that would be awesome.
The code which reports the missing Send: