TidierOrg / TidierVest.jl

Tidier web scraping in Julia, modeled after the rvest R package.
MIT License
30 stars 3 forks source link

polite for robots.txt #5

Open PallHaraldsson opened 1 year ago

PallHaraldsson commented 1 year ago

Hi,

I like seeing this new package, I did know of Tidier but was clearly ignorant of rvest, wasn't expecting it to have web scraping.

Are you reimplementing 100% just to have same API as tidier, or would this be now the go-to Julia package for web scraping? It seems your dependencies do not do it. I seemed to recall Julia people doing already, but they may have by calling beautiful soup (would that be the best Python package for it, and maybe best of all [including, at least previously, also native Julia packages]?)

I see:

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

To be clear, that is not yet implemented/ported in another package, or included into this one? If it belongs here please add to the to do at #1.

[I like the name and pun, and the logo, it's just very obscure what this package is about. I hope the package will not be overlooked for that reason. It should not, The name of beautiful soup doesn't seem to have harmed.]

kdpsingh commented 1 year ago

Thanks for the feedback. The original "rvest" name is a play on words on the word "harvest" (as in to harvest/scrape a web page). TidierVest is a further play on words on that.

We will make make the README (and eventually the documentation) clear on the purpose of the package.

The TidierVest package is fully implemented in Julia so there is no dependency on R. I love the suggestion of respecting robots.txt and rate-limiting requests. To do this, we would need to implement the concepts from the polite R package in Julia.

Will let @jdiaz97 weigh in on his thoughts and whether he has the bandwidth to work on this.

PallHaraldsson commented 1 year ago

Original "rvest" for harvest, I see it now, though h not silent. :)

I believe that package is useful even without Tider.jl (or its R equivalent, nor R), giving you a DataFrame, so I would make it clear in the docs/README. [And possibly mention that you still can use Tidier with it.]

Given the name is even more obscure, and I don't think you want to rename the package, I also suggest explaining rvest, and TiderVest in the README. I think it might make it likelier for the name to stick in you mind after you learn this.

Why polite is a separate package in R, I don't know, but if it's not too big, then maybe its functionality fits here as (an optional) feature. I'm not pressing for it implemented (soon), was also curious if already available elsewhere in Julia. Do you think this is the best (non-polite) web scraping (or only?) package in Julia yet? How would you rate this package, or the original vs Beautiful Soup or any best-in-class web scraping package?

[You can of course use polite as is, if really needed, i.e. in R, and then with rvest, from Julia; though most likely not (that) polite with TiderVest.jl.]

I see now: https://stackoverflow.com/questions/59825336/how-can-i-do-web-scraping-in-julia

and Cascadia.j to finally scrape using a CSS selector API.

So maybe some functionality belongs there in that "CSS selector" library, I didn't know what that was, so overlooked, it did not at all seem like web scraping functionality. I thought I sort of new what CSS is about though.

jdiaz97 commented 7 months ago

Hi @PallHaraldsson, thanks for the comments.

I also suggest explaining rvest, and TiderVest in the README

Will do

I don't know, but if it's not too big, then maybe its functionality fits here as (an optional) feature.

I was thinking the same thing, pretty sure we could implement the core functions bow() and scrape() without bloating tidiervest. https://github.com/dmi3kno/polite

Do you think this is the best (non-polite) web scraping (or only?) package in Julia yet? How would you rate this package, or the original vs Beautiful Soup or any best-in-class web scraping package?

I think TidierVest has the best syntax right now, it's just sugarcode tho, it doesn't add new features that didn't exist before. But I like it and sits at the same spot as rvest, imo. I haven't used Beautiful Soup, so I don't know, but we're missing some key features that rvest also doesn't have, but Selenium does, so maybe we have to take a look at that and see how to implement them.