eblondel / zen4R

zen4R - R Interface to Zenodo REST API
https://github.com/eblondel/zen4R/wiki
Other
44 stars 14 forks source link

Bug in getRecordByDOI ? #98

Closed sigven closed 1 year ago

sigven commented 2 years ago

Hi,

Thanks for a very elegant piece of software. I have developed a tool that utilizes zen4R . I just played with the latest version of zen4R (v0.7), and I get the following error now when calling getRecordByDOI

> zenodo <- zen4R::ZenodoManager$new()
> zenodo_doi <- "10.5281/zenodo.6828501"
> rec <- zenodo$getRecordByDOI(zenodo_doi)
Error in result$doi : $ operator is invalid for atomic vectors

This used to work for v0.6-1.

Thanks for any help,

Best, Sigve

ablaette commented 2 years ago

I gratefully concur with the big thank you for zen4R!

Adding to the issue reported by Sigve: My package cwbtools runs into the same issue, see this issue with another sample: https://github.com/PolMine/cwbtools/issues/42

Looking at the thread of issue #84 I would have thought that this is solved with v0.7.0. Would be great if you could restore this part of the functionality!

Andreas

jeffreyhanson commented 1 year ago

I seem to be experiencing this issue again (on the university network if that makes a difference). @sigven and @ablaette, it might be worth commenting on the Zenodo issue (https://github.com/zenodo/zenodo/issues/2358) to help increase visibility?

sigven commented 1 year ago

@jeffreyhanson will do. For my R packages hosting data, I have currently switched to the use of googledrive, although i think using Zenodo is a much more attractive option, considering all the capabilities wrt versioning etc.

jeffreyhanson commented 1 year ago

Awesome - thanks! Yeah, nice idea! I think I might look into seeing if I can write some utility functions to scrape file download URLs and version numbers to avoid switching off Zendo entirely.

jeffreyhanson commented 1 year ago

Here's an attempt at some utility functions for scrapping information. Feel free to re-use/modify if you find them useful. I've tried to implement these functions to provide outputs in a similar format to the zen4R package, so hopefully they should serve as a drop in replacement. Dependencies include assertthat, rvest, dplyr, and tibble.

#' Get files associated with DOI
#'
#' Get information on the files associated with a DOI.
#'
#' @param x `character` Value containing a DOI.
#'
#' @return
#' A `tibble::tibble()` containing the filename and download URL
#' for each file associated with the DOI. Note that rows are ordered
#' such that the first row corresponds to the oldest record.
#'
#' @examples
#' get_doi_files("https://doi.org/10.5281/zenodo.6622038")
#'
#' @noRd
get_doi_files <- function(x) {
  # assert valid argument
  assertthat::assert_that(
    assertthat::is.string(x),
    assertthat::noNA(x)
  )
  assertthat::assert_that(
    startsWith(x, "https://doi.org/"),
    msg = "argument to x is not a DOI."
  )
  assertthat::assert_that(
    grepl("/zenodo.", x, fixed = TRUE),
    msg = "argument to x is not a Zenodo DOI."
  )

  # scrape html file
  d <- rvest::read_html(x)

  # find file container
  file_div <- rvest::html_elements(d, css = ".files-box")

  # find table in file container
  file_table <- rvest::html_element(file_div, "table")
  file_rows <- rvest::html_children(rvest::html_element(file_table, "tbody"))
  file_info <- rvest::html_element(file_rows, "a")

  # extract file details
  file_names <- rvest::html_text(file_info)
  file_urls <- paste0(
    "https://zenodo.org", rvest::html_attr(file_info, "href")
  )

  # return result
  tibble::tibble(filename = file_names, download = file_urls)
}

#' Get DOI versions
#'
#' Get all the DOI versions for associated with a given DOI.
#'
#' @param x `character` Value containing a DOI.
#'
#' @return A `character` vector with DOIs.
#'
#' @examples
#' get_doi_versions("https://doi.org/10.5281/zenodo.6622038")
#'
#' @noRd
get_doi_versions <- function(x) {
  # assert valid argument
  assertthat::assert_that(
    assertthat::is.string(x),
    assertthat::noNA(x)
  )
  assertthat::assert_that(
    startsWith(x, "https://doi.org/"),
    msg = "argument to x is not a DOI."
  )
  assertthat::assert_that(
    grepl("/zenodo.", x, fixed = TRUE),
    msg = "argument to x is not a Zenodo DOI."
  )

  # scrape html file
  d <- rvest::read_html(x)

  # find metadata containers
  metadata_divs <- rvest::html_elements(d, css = ".metadata")

  # extract div containing version numbers
  is_version_div <- vapply(metadata_divs, FUN.VALUE = logical(1), function(x) {
    h <- rvest::html_elements(x, css = "h4")
    if (length(h) == 0) return(FALSE)
    h <- h[[1]]
    identical(rvest::html_text(h), "Versions")
  })

  # return input doi if it's not associated with any versions
  if (!any(is_version_div)) {
    d <- tibble::tibble(
      version = NA_character_,
      created = as.POSIXct(NA_real_),
      doi = gsub("https://doi.org/", "", x, fixed = TRUE)
    )
    return(d)
  }

  # extract div containing version numbers
  version_div <- metadata_divs[[which(is_version_div)[[1]]]]

  # extract version table
  version_table <- rvest::html_element(version_div, "table")
  version_rows <- rvest::html_children(version_table)

  # parse information for each version
  info <- lapply(version_rows, function(x) {
    tibble::tibble(
      version = trimws(gsub(
        "Version ", "", fixed = TRUE,
        rvest::html_text(rvest::html_element(x, "a"))
      )),
      created = as.POSIXct(
        trimws(rvest::html_text(
          rvest::html_element(rvest::html_children(x)[[2]], "small")
        )),
        format = "%b %e, %Y"
      ),
      doi = trimws(rvest::html_text(rvest::html_element(x, "small"))),
    )
  })

  # compile table
  info <- dplyr::bind_rows(info)

  ## return result (reverse row ordering for compatibility with zen4R
  info[rev(seq_len(nrow(info))), , drop = FALSE]
}
sigven commented 1 year ago

Wow, awesome! Will give this a try:D

jeffreyhanson commented 1 year ago

Awesome - thanks! Let me know if you encounter any issues? If you need a leaner implementation, it might be possible to replace (1) tibbles with data.frames and (2) bind_rows(x) with do.call(rbind, x) - but I (pesonally) prefer tibbles over data/fames.

jeffreyhanson commented 1 year ago

Also, it's worth noting that I would consider these scrapping functions to be a stop gap measure until the API issues are resolved. This is because changes to the Zenodo website could potentially invalidate the functions, resulting in errors or incorrect outputs :)

jeffreyhanson commented 1 year ago

@sigven, just to let you know, I realized that the code I posted gave incosistent behavior for the get_doi_versions() function if the Zenodo repository type doesn't support multiple versions (i.e. it returned a character and not a tibble) - so I've edited the code in the original post to address this.

sigven commented 1 year ago

Great, thanks for notifying.

eblondel commented 1 year ago

@sigven @jeffreyhanson @ablaette sorry for the late answer, i'm back for annual leave. I would welcome your tests again, as I suspect a temporary issue with the Zenodo platform. I've tried now the below code with zen4R 0.7, and everything worked:

zen = ZenodoManager$new()
zen$getRecordByDOI("10.5281/zenodo.6828501")

Thanks in advance

nehamoopen commented 1 year ago

I've been running into the same issue:

This is the code I used (without a token):

library(zen4R)
zenodo <- ZenodoManager$new(logger = "INFO")
record <- zenodo$getRecordByDOI("10.5281/zenodo.3332807") 

And I get the following output:

[zen4R][INFO] ZenodoRequest - Fetching https://zenodo.org/api/records/?q=doi:10.5281//zenodo.3332807&size=10&page=1&all_versions=1 
[zen4R][ERROR] ZenodoRequest - Error while executing request 'records/?q=doi:10.5281//zenodo.3332807&size=10&page=1&all_versions=1' 
[zen4R][ERROR] ZenodoManager - Error while fetching published records: Internal Server Error 
Error: $ operator is invalid for atomic vectors

I imagine it has to do with issue #84 as well. Just adding my experience to the thread, so there's a record of it. Hopefully it can get fixed soon! Let me know if there are any other tests I should do.

eblondel commented 1 year ago

Removing the specific user agent (see https://github.com/eblondel/zen4R/commit/f95568711590abbf7c80a1e4cbb4c2204b1d7211 ) seems to fix the issue. @sigven @jeffreyhanson @nehamoopen @ablaette can you reinstall and try again?

eblondel commented 1 year ago

Relates to https://github.com/eblondel/zen4R/issues/103

eblondel commented 1 year ago

Fixed with #103

eblondel commented 1 year ago

Actually solved in pending github revision done with https://github.com/eblondel/zen4R/issues/106