Closed rcepka closed 2 weeks ago
So the main thing you are doing that's probably not the best idea is you are sorting through every <p>
element on the page. You end up with about 700 elements (!).
ss("p")
#> { selenider_elements (705) }
#> [1] <p class="content-paragraph lead-magnet">This free Notion document contains t ...
#> [2] <p class="content-paragraph lead-magnet">This free eBook goes over the 10 sli ...
#> [3] <p class="content-paragraph lead-magnet">This free sheet contains 100 acceler ...
#> [4] <p class="content-paragraph lead-magnet">This free sheet contains 100 VC firm ...
#> [5] <p class="content-paragraph lead-magnet">This free sheet contains all the inf ...
#> [6] <p class="pre-content-ad-tag column">Ad</p>
#> [7] <p class="in-content-ad-text column">Description</p>
#> [8] <p>The Big Data & Analytics sector has been growing and evolving in recen ...
#> [9] <p>Investors and venture capital firms have been backing more founders from t ...
#> [10] <p>Here’s a list of venture capital firms providing funds to startups in the ...
#> [11] <p></p>
#> [12] <p class="content-paragraph">Everything you need to raise funding for your st ...
#> [13] <p class="content-paragraph">Information about the countries, cities, stages, ...
#> [14] <p class="content-paragraph">List of 250 startup investors in the AI and Mach ...
#> [15] <p class="content-paragraph">List of startup investors in the BioTech, Health ...
#> [16] <p class="content-paragraph">List of startup investors in the FinTech industr ...
#> [17] <p>Y Combinator is a leading accelerator and venture capital providing mentor ...
#> [18] <p><strong>Details of the VC firm:</strong></p>
#> [19] <p>You can find their website <a href="https://ycombinator.com" target="_blan ...
#> [20] <p>You can send them an email at <a href="mailto:info@ycombinator.com" target ...
#> ...
You're also using lapply()
, which is pretty inefficient on large sets of elements because of selenider's laziness property. In this case, what that means is this set of <p>
elements is being fetched and filtered for every iteration in the lapply()
call. It's sort of unsurprising that chromote ends up folding under this many requests (although it is quite annoying).
The other small mistake is that you try to get the href
of each <p>
element, where you should be getting the href
of their child <a>
elements.
The solution to this is to reduce the amount of work that selenider has to do. For example, selecting the immediate parent of all the <p>
elements works for me:
library(selenider)
session <- selenider_session()
open_url("https://www.failory.com/blog/big-data-analytics-venture-capital-firms")
ss("article.content-rich-text")[[2]] |>
find_elements("p") |>
elem_filter(has_text("You can find their website ")) |>
as.list() |>
lapply(
\(x) x |>
find_element("a") |>
elem_attr("href")
)
That being said, this is a static site, so you're going to be much better off using rvest:
library(rvest)
html <- read_html("https://www.failory.com/blog/big-data-analytics-venture-capital-firms")
html |>
html_elements("p") |>
Filter(x = _, \(x) grepl("You can find their website ", html_text(x))) |>
lapply(
\(x) x |>
html_element("a") |>
html_attr("href")
)
# Or, with purrr
library(purrr)
html |>
html_elements("p") |>
keep(\(x) grepl("You can find their website ", html_text(x))) |>
map(
\(x) x |>
html_element("a") |>
html_attr("href")
)
This last solution runs in less than a second for me.
@ashbythorpe Thank you so much, I appreciate your complete and helpful answer.
I initially reached for the Selenider to solve this task because of it´s useful functions elem_filter(has_text())
. This conditional filtering was very handy for for identifying the target object to scrape.
But as you pointed out, rvest is probably more suitable tool for this task. Actually, although your Selenider-based solution works, I was never able to iterate successfully through all items in the list, it never get over item ~50.
Also, looking at the structure of Selenider list object str(ss("article.content-rich-text")[[2]])
I tried to find a way to drill-down to the information I need to lessen for Selenider the size of object to manipulate with; but I was not able to do this.
So I turned to your rvest-based solution, this works perfectly.
Thank you once again for your help and for your willingness.
I want to scrape websites addresses from this site:
https://www.failory.com/blog/big-data-analytics-venture-capital-firms
The page is structured as a sequence of
p
tags so I wanted to use textYou can find their website
located before each web address for identification. The excellent functionelem_find(has_text())
is perfect for this I think.The following creates a list of paragraphs containing addresses:`
Then I tried the following (and several variants and combinations):
and
but neither worked for me.
I also tried the conventional way:
I have never been able to complete the script, I alwazs got error>
In overall, the entire RStudio is unstable when working with this particular case and trying around different alternatives.
Can you please advice me on the best way to handle Selenider element lists like this? Many thanks in advance.