datacamp / r-package-parser

R Package to parse R documentation files for RDocumentation
1 stars 3 forks source link

get_description() and parse_description() assume native encoding #13

Open bastistician opened 6 years ago

bastistician commented 6 years ago

https://github.com/datacamp/r-package-parser/blob/cb48a0368626a6f2d3ce66020e7a270d2775e2d4/R/processing.R#L36

parse()ing the text from Authors@R does not work if that field contains non-ASCII characters and the DESCRIPTION file is not in the native encoding of the system processing the package (UTF-8). Typical examples are "latin1" packages with accented characters in author names, e.g.:

res <- process_package("https://cran.r-project.org/src/contrib/flexrsurv_1.4.1.tar.gz", "flexrsurv", "cran")

Proper handling of package descriptions is provided by the desc package. However, a simple fix to just support packages in latin1 encoding in addition to UTF-8 is to mark the Encoding() in get_description() as in utils:::.read_description():

get_description <- function(pkg_folder) {
  desc_path <- file.path(pkg_folder, "DESCRIPTION")
  out <- read.dcf(desc_path)[1, ]
  if (identical(out[["Encoding"]], "latin1")) {
    Encoding(out) <- "latin1"
  }
  as.list(out)
}

This might fix https://github.com/datacamp/RDocumentation-app/issues/386.

filipsch commented 6 years ago

@WastlM interesting! Thanks for the digging and the pointer. @ludov04 or I will have a look asap, hopefully it solves the parsing issues!

bastistician commented 6 years ago

Here's a better fix, which converts the input (regardless of its encoding) to the native encoding in case the DESCRIPTION has an Encoding field:

get_description <- function(pkg_folder) {
  desc_path <- file.path(pkg_folder, "DESCRIPTION")
  out <- read.dcf(desc_path)[1L, ]
  if ("Encoding" %in% names(out)) {
    Encoding(out) <- out[["Encoding"]]
    out <- enc2native(out)
  }
  as.list(out)
}

Conversion to the native encoding ensures that the subsequent parsing and evaluation of the Authors@R field works on the system which runs the code. I think that's the way to go!