gesistsa / webbotparseR

:mag: R package to parse search engine results
https://gesistsa.github.io/webbotparseR/
Other
8 stars 1 forks source link
browser-extension rstats rstats-package search-engine

webbotparseR

Codecov test
coverage R-CMD-check

webbotparseR allows to parse search engine results that where scraped with the WebBot browser extension. A similar python library is also available.

Installation

You can install the development version of webbotparseR like so:

remotes::install_github("schochastics/webbotparseR")

The package contains an example html from a google search on climate change.

library(webbotparseR)
ex_file <- system.file("www.google.com_climatechange_text_2023-03-16_08_16_11.html", package = "webbotparseR")

Such search results can be parsed via the function parse_search_results(). The parameter engine is used to specify the search engine and the search type.

output <- parse_search_results(path = ex_file, engine = "google text")
output
#> # A tibble: 10 × 10
#>    title              link  text  image page  position search_engine type  query
#>    <chr>              <chr> <chr> <chr> <chr>    <int> <chr>         <chr> <chr>
#>  1 What Is Climate C… http… Clim… data… 1            1 www.google.c… text  clim…
#>  2 Home – Climate Ch… http… Vita… data… 1            2 www.google.c… text  clim…
#>  3 Vital Signs of th… http… “Cli… data… 1            3 www.google.c… text  clim…
#>  4 Climate change - … http… In c… data… 1            4 www.google.c… text  clim…
#>  5 IPCC — Intergover… http… The … data… 1            5 www.google.c… text  clim…
#>  6 Climate Change | … http… Comp… data… 1            6 www.google.c… text  clim…
#>  7 Climate change: e… http… Clim… <NA>  1            7 www.google.c… text  clim…
#>  8 UNFCCC             http… What… data… 1            8 www.google.c… text  clim…
#>  9 Climate Change - … http… Clim… data… 1            9 www.google.c… text  clim…
#> 10 Causes of climate… http… This… data… 1           10 www.google.c… text  clim…
#> # ℹ 1 more variable: date <dttm>

Note that images are always returned base64 encoded.

output$image[1]
#> [1] "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAIAAACQkWg2AAAABnRSTlMAAAAAAABupgeRAAAAMklEQVR4AWMAgYYG4hEdNJAHGoCIABvBJayhgcYaIAwaakCwydUA52MKYeeSCgZh4gMAXrJ9ASggqqAAAAAASUVORK5CYII="

The function base64_to_img() can be used to decode the image and save it in an appropriate format.