futureverse / marshal

[PROTOTYPE] R package: marshal - Framework to Marshal Objects to be Used in Another R Processes
https://marshal.futureverse.org/
Other
15 stars 1 forks source link

tibble: A `tbl` may contain an external pointer via attribute `problems` set by readr #9

Open HenrikBengtsson opened 1 year ago

HenrikBengtsson commented 1 year ago

A tbl may contain an external pointer via attribute problems, e.g.

spc_tbl_ [25,000 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Index        : num [1:25000] 1 2 3 4 5 6 7 8 9 10 ...
 $ Height_Inches: num [1:25000] 65.8 71.5 69.4 68.2 67.8 ...
 $ Weight_Pounds: num [1:25000] 113 136 153 142 144 ...
 - attr(*, "spec")=
  .. cols(
  ..   Index = col_double(),
  ..   Height_Inches = col_double(),
  ..   Weight_Pounds = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
HenrikBengtsson commented 10 months ago

It's actually the readr package that adds the problems attribute. From help("problems", package = "readr"):

"Readr functions will only throw an error if parsing fails in an unrecoverable way. However, there are lots of potential problems that you might want to know about - these are stored in the problems attribute of the output ..."

HenrikBengtsson commented 10 months ago

marshal() on a tbl object could simply drop the problems attribute.

HenrikBengtsson commented 10 months ago

marshal() on a tbl object could simply drop the problems attribute.

Ah, the problems attribute may also contain non-pointer objects, so we don't always have to drop it. For example,

> x <- parse_integer(c("1X", "blah", "3"))
Warning: 2 parsing failures.
row col               expected actual
  1  -- no trailing characters   1X  
  2  -- no trailing characters   blah

> str(x)
 int [1:3] NA NA 3
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 2
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "no trailing characters" "no trailing characters"
  ..$ actual  : chr [1:2] "1X" "blah"

More clues about alternatives can be found in:

readr:::problems
function (x = .Last.value) 
{
    problems <- probs(x)
    if (is.null(problems)) {
        return(invisible(no_problems))
    }
    if (inherits(problems, "tbl_df")) {
        return(problems)
    }
    vroom::problems(x)
}

So, it looks like vroom might be involved too;

> vroom::problems
function (x = .Last.value, lazy = FALSE) 
{
    if (!inherits(x, "tbl_df")) {
        cli::cli_abort(c("The {.arg x} argument of {.fun vroom::problems} must be a data frame created by vroom:", 
            x = "{.arg x} has class {.cls {class(x)}}"))
    }
    if (!isTRUE(lazy)) {
        vroom_materialize(x, replace = FALSE)
    }
    probs <- attr(x, "problems")
    if (typeof(probs) != "externalptr") {
        cli::cli_abort(c("The {.arg x} argument of {.fun vroom::problems} must be a data frame created by vroom:", 
            x = "{.arg x} seems to have been created with something else, maybe readr?"))
    }
    probs <- vroom_errors_(probs)
    probs <- probs[!duplicated(probs), ]
    probs <- probs[order(probs$file, probs$row, probs$col), ]
    tibble::as_tibble(probs)
}
<environment: namespace:vroom>
HenrikBengtsson commented 10 months ago

From the above, marshalling of tbl_df (sic!) could rely on the following "pruning" method:

prune.tbl_df <- function(x, ...) {
  problems <- attr(x, "problems", exact = TRUE)

  ## Materialize `problems` stored elsewhere in this process?
  if (typeof(problems) == "externalptr") {
     problems <- vroom::problems(x)
     attr(x, "problems") <- problems
  }

  x
}

Comment: We could use NextMethod("prune") at the end.

Comment 2: We've punted on the idea of having prune() methods thus far, but maybe this is an argument for having them. Maybe it should be names something else than "prune", because pruning could also mean "drop unnecessary content".