ashbythorpe / selenider

Concise, Lazy and Reliable Wrapper for 'chromote' and 'selenium'
https://ashbythorpe.github.io/selenider/
Other
34 stars 2 forks source link

Hints how to process Selenider elements list #35

Closed rcepka closed 2 weeks ago

rcepka commented 3 weeks ago

I want to scrape websites addresses from this site: https://www.failory.com/blog/big-data-analytics-venture-capital-firms

The page is structured as a sequence of p tags so I wanted to use text You can find their website located before each web address for identification. The excellent function elem_find(has_text()) is perfect for this I think.

session <- selenider::selenider_session(
    "chromote",
    timeout = 10
  )
  selenider::open_url("https://www.failory.com/blog/big-data-analytics-venture-capital-firms")

The following creates a list of paragraphs containing addresses:`

  par_websites <- selenider::ss("p") |> selenider::elem_filter(has_text("You can find their website "))

{ selenider_elements (141) }
[1] <p>You can find their website <a href="https://ycombinator.com" target="_blank" rel="nofollow"><strong>here</strong></a>.</p>
[2] <p>You can find their website <a href="https://techstars.com" target="_blank" rel="nofollow"><strong>here</strong></a>.</p>
[3] <p>You can find their website <a href="https://lsvp.com" target="_blank" rel="nofollow"><strong>here</strong></a>.</p>

Then I tried the following (and several variants and combinations):

par_websites |> lapply(\(x) selenider::elem_attr(x, "href"))

and

par_websites |> as.list() |> lapply(\(x) selenider::elem_attr(x, "href"))

but neither worked for me.

Error: Chromote: timed out waiting for response to command DOM.describeNode
Called from: doTryCatch(return(expr), name, parentenv, handler)

I also tried the conventional way:

  webs_1 <- selenider::ss("p") |> selenider::elem_filter(selenider::has_text("You can find their website "))
  websites <- list()

  for (i in seq_along(webs_1)) {
   website <- res_1[[i]] %>% read_html() %>% html_element("a") %>% html_attr("href")
   websites <- append(websites, website)
   }

I have never been able to complete the script, I alwazs got error>

Error: Chromote: timed out waiting for response to command DOM.describeNode

In overall, the entire RStudio is unstable when working with this particular case and trying around different alternatives.

Can you please advice me on the best way to handle Selenider element lists like this? Many thanks in advance.

ashbythorpe commented 3 weeks ago

So the main thing you are doing that's probably not the best idea is you are sorting through every <p> element on the page. You end up with about 700 elements (!).

ss("p")
#> { selenider_elements (705) }
#> [1] <p class="content-paragraph lead-magnet">This free Notion document contains t ...
#> [2] <p class="content-paragraph lead-magnet">This free eBook goes over the 10 sli ...
#> [3] <p class="content-paragraph lead-magnet">This free sheet contains 100 acceler ...
#> [4] <p class="content-paragraph lead-magnet">This free sheet contains 100 VC firm ...
#> [5] <p class="content-paragraph lead-magnet">This free sheet contains all the inf ...
#> [6] <p class="pre-content-ad-tag column">Ad</p>
#> [7] <p class="in-content-ad-text column">Description</p>
#> [8] <p>The Big Data &amp; Analytics sector has been growing and evolving in recen ...
#> [9] <p>Investors and venture capital firms have been backing more founders from t ...
#> [10] <p>Here’s a list of venture capital firms providing funds to startups in the  ...
#> [11] <p>‍</p>
#> [12] <p class="content-paragraph">Everything you need to raise funding for your st ...
#> [13] <p class="content-paragraph">Information about the countries, cities, stages, ...
#> [14] <p class="content-paragraph">List of 250 startup investors in the AI and Mach ...
#> [15] <p class="content-paragraph">List of startup investors in the BioTech, Health ...
#> [16] <p class="content-paragraph">List of startup investors in the FinTech industr ...
#> [17] <p>Y Combinator is a leading accelerator and venture capital providing mentor ...
#> [18] <p><strong>Details of the VC firm:</strong></p>
#> [19] <p>You can find their website <a href="https://ycombinator.com" target="_blan ...
#> [20] <p>You can send them an email at <a href="mailto:info@ycombinator.com" target ...
#> ...

You're also using lapply(), which is pretty inefficient on large sets of elements because of selenider's laziness property. In this case, what that means is this set of <p> elements is being fetched and filtered for every iteration in the lapply() call. It's sort of unsurprising that chromote ends up folding under this many requests (although it is quite annoying).

The other small mistake is that you try to get the href of each <p> element, where you should be getting the href of their child <a> elements.

The solution to this is to reduce the amount of work that selenider has to do. For example, selecting the immediate parent of all the <p> elements works for me:

library(selenider)

session <- selenider_session()

open_url("https://www.failory.com/blog/big-data-analytics-venture-capital-firms")

ss("article.content-rich-text")[[2]] |>
  find_elements("p") |>
  elem_filter(has_text("You can find their website ")) |>
  as.list() |>
  lapply(
    \(x) x |>
      find_element("a") |>
      elem_attr("href")
  )

That being said, this is a static site, so you're going to be much better off using rvest:

library(rvest)

html <- read_html("https://www.failory.com/blog/big-data-analytics-venture-capital-firms")

html |>
  html_elements("p") |>
  Filter(x = _, \(x) grepl("You can find their website ", html_text(x))) |>
  lapply(
    \(x) x |>
      html_element("a") |>
      html_attr("href")
  )

# Or, with purrr
library(purrr)

html |>
  html_elements("p") |>
  keep(\(x) grepl("You can find their website ", html_text(x))) |>
  map(
    \(x) x |>
      html_element("a") |>
      html_attr("href")
  )

This last solution runs in less than a second for me.

rcepka commented 2 weeks ago

@ashbythorpe Thank you so much, I appreciate your complete and helpful answer. I initially reached for the Selenider to solve this task because of it´s useful functions elem_filter(has_text()). This conditional filtering was very handy for for identifying the target object to scrape. But as you pointed out, rvest is probably more suitable tool for this task. Actually, although your Selenider-based solution works, I was never able to iterate successfully through all items in the list, it never get over item ~50. Also, looking at the structure of Selenider list object str(ss("article.content-rich-text")[[2]]) I tried to find a way to drill-down to the information I need to lessen for Selenider the size of object to manipulate with; but I was not able to do this. So I turned to your rvest-based solution, this works perfectly. Thank you once again for your help and for your willingness.