webbotparseR allows to parse search engine results that where scraped with the WebBot browser extension. A similar python library is also available.
You can install the development version of webbotparseR like so:
remotes::install_github("schochastics/webbotparseR")
The package contains an example html from a google search on climate change.
library(webbotparseR)
ex_file <- system.file("www.google.com_climatechange_text_2023-03-16_08_16_11.html", package = "webbotparseR")
Such search results can be parsed via the function
parse_search_results()
. The parameter engine
is used to specify the
search engine and the search type.
output <- parse_search_results(path = ex_file, engine = "google text")
output
#> # A tibble: 10 × 10
#> title link text image page position search_engine type query
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 What Is Climate C… http… Clim… data… 1 1 www.google.c… text clim…
#> 2 Home – Climate Ch… http… Vita… data… 1 2 www.google.c… text clim…
#> 3 Vital Signs of th… http… “Cli… data… 1 3 www.google.c… text clim…
#> 4 Climate change - … http… In c… data… 1 4 www.google.c… text clim…
#> 5 IPCC — Intergover… http… The … data… 1 5 www.google.c… text clim…
#> 6 Climate Change | … http… Comp… data… 1 6 www.google.c… text clim…
#> 7 Climate change: e… http… Clim… <NA> 1 7 www.google.c… text clim…
#> 8 UNFCCC http… What… data… 1 8 www.google.c… text clim…
#> 9 Climate Change - … http… Clim… data… 1 9 www.google.c… text clim…
#> 10 Causes of climate… http… This… data… 1 10 www.google.c… text clim…
#> # ℹ 1 more variable: date <dttm>
Note that images are always returned base64 encoded.
output$image[1]
#> [1] "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAIAAACQkWg2AAAABnRSTlMAAAAAAABupgeRAAAAMklEQVR4AWMAgYYG4hEdNJAHGoCIABvBJayhgcYaIAwaakCwydUA52MKYeeSCgZh4gMAXrJ9ASggqqAAAAAASUVORK5CYII="
The function base64_to_img()
can be used to decode the image and save
it in an appropriate format.