gastonbecerra / ojsr

R package to crawl and scrape OJS (open journal system)
3 stars 1 forks source link
oai-pmh ojs rstats scraper web-scraping

OJS Scraper for R

CRAN status R-CMD-check

The aim of this package is to aid you in crawling OJS archives, issues, articles, galleys, and search results, and retrieving/scraping metadata from articles. ojsr functions rely on OJS routing conventions to compose the URL for different scraping scenarios.

Installation

From CRAN:

install.packages('ojsr') 

From Github:

install.packages('devtools') 
devtools::install_github("gastonbecerra/ojsr")

ojsr functions

Example

Let's say we want to collect metadata from some journals to compare their top keywords. We have the journals' names and URLs, and can use ojsr to scrap their issues, articles and metadata.


library(dplyr) 
library(ojsr)

journals <- data.frame ( cbind(
    name = c( "Revista Evaluar", "PSocial" ),
    url = c( "https://revistas.unc.edu.ar/index.php/revaluar", "https://publicaciones.sociales.uba.ar/index.php/psicologiasocial")
  ), stringsAsFactors = FALSE )

# we are using the journal URL as input to retrieve the issues
issues <- ojsr::get_issues_from_archive(input_url = journals$url) 

# we are using the issues URL we just scraped as an input to retrieve the articles
articles <- ojsr::get_articles_from_issue(input_url = issues$output_url)

# we are using the articles URL we just scraped as an input to retrieve the metadata
metadata <- ojsr::get_html_meta_from_article(input_url = articles$output_url)

# let's parse the base URLs from journals and metadata, so we can bind by journal
journals$base_url <- ojsr::parse_base_url(journals$url)
metadata$base_url <- ojsr::parse_base_url(metadata$input_url)

metadata %>% filter(meta_data_name=="citation_keywords") %>% # filtering only keywords
  left_join(journals) %>% # include journal names
  group_by(base_url, keyword = meta_data_content) %>% tally(sort=TRUE)