Gonzih / crabler

Web Crawler for Crabs
https://docs.rs/crabler/
MIT License
92 stars 4 forks source link

Calling run() on a Webscraper means you can't get any intermediate results out #7

Closed Susurrus closed 3 years ago

Susurrus commented 3 years ago

I'm storing some state as I run my web scraper from the text/links I'm parsing, and I though it'd make the most sense to store them as additional fields on my WebScraper struct. However, after I call WebScraper::run(...).await to kick off my scraping run, I can't access the struct afterwards are run() consumes it. I assume this is because run() is the last thing you want to do with a scraper, and it doesn't make sense to call it twice, so you consume it to prevent that. So I assume it doesn't make sense to modify it to use &mut self instead. However, what about if it returned the WebScraper struct at the end? This way you could run it again if you wanted to, but you can't run it simultaneously as the struct is consumed. This would allow me to get the struct back out at the end and extract the metadata I want.

An alternative implementation I could do is keep things in a static variable instead. That doesn't feel very rusty, so I went with this approach instead. If that's the more Rusty way to do this, then I think this can be closed.

Susurrus commented 3 years ago

Also, forgot to preface these issues with this, thanks so much for developing this! It's pretty easy to use and works great for my basic application, and I've never used async Rust before! So well done with the ergonomics!

Gonzih commented 3 years ago

I designed crabler around the idea of your struct being storage for necessary thing for runtime. If you want something to be accessible outside of struct you can store Rc<RefCell> pointer or something like that in your struct, mutate it in crawler and after run exits get the values. There are couple of way that you can do shared RW state in rust, most of them should work with this library. You might need to go with some thread safe mechanism since crabler is asynchronous in its nature.

Feel free to let me know if you stumble on any additional problems.

Susurrus commented 3 years ago

Got it. So I need an additional struct with data using Rc<RefCell<>> that should give me what I wanted. I was thinking that's what I needed, but figured I'd post an issue here to check that this is an intended design of this lib. I would've assumed that shoving this in the struct itself would be the right approach, so this wasn't intuitive for me, but maybe this is a Rust idiom I'm less familiar with (I'm a very casual Rust programmer).

I'm going to close this out as this doesn't need to be solved by anything with this lib.