gesistsa / rio

🐟 A Swiss-Army Knife for Data I/O
http://gesistsa.github.io/rio/
604 stars 76 forks source link

CVE-2024-27322 and this package #406

Open chainsawriot opened 7 months ago

chainsawriot commented 7 months ago

CVE-2024-27322 is partially fixed in R 4.4.0. But the attack surface is still there. First, this package supports R > 3.6 therefore the partial fix in 4.4.0 is not applied in many supported versions. Second, even with 4.4.0 deserialization of .Rdata and .RDS in some cases can still invoke arbitrary code execution. See this message by gws.

We can't be too nanny but should we rethink the "any R object" policy of .Rdata

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L31

.RDS

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L32

and qs

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L33-L35

There are several options:

  1. Warn about non data frame object
  2. Completely forbid non data frame object
  3. Only forbid Promise
  4. Just warn in the doc
  5. Don't be nanny
chainsawriot commented 7 months ago

@schochastics @jsonbecker thoughts?

jsonbecker commented 7 months ago

In my opinion, rio's intent is to load data.frame like objects, not restore R sessions. Support for RDS is support for one way someone might save those kind of objects. I'd be comfortable allow-listing certain classes of objects on both write and load and specify that, possibly with a message. If the intent is full session storage, this is not the right method/package.

I mention allow list because I think we'd want to include things like data.table and tibble objects.

This would be a breaking change, potentially, and may warrant a major version number update if we wanted to be closer to semver.

hrbrmstr commented 7 months ago

I just made https://github.com/hrbrmstr/rdaradar and I think doing some quarantining of R data files in the rio load process to then pick out "data frame-like objects" might be the way to go

jsonbecker commented 7 months ago

I just made hrbrmstr/rdaradar and I think doing some quarantining of R data files in the rio load process to then pick out "data frame-like objects" might be the way to go

So in this case, we'd:

  1. Load into a new.env()
  2. Filter the list of objects by class against an allow list of classes.
  3. Load the objects with the correct classes.
  4. Remove the quarantine environment in cleanup.

Are there any concerns about how R makes it... incredibly simple to just declare you're one class (along with others)? Put another way, should we restrict certain classes explicitly as well, regardless of whether they also declare they are a data.frame-alike?

chainsawriot commented 7 months ago

Thank you all for the feedback.

Specifically, this is the line that makes exceptions for the serialization formats.

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L146-L148

And for other formats, rio will force the output to be data.frame. So anything that cannot be converted to data frame would produce an error.

SEE BELOW

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/set_class.R#L43-L52

We do not have tests for that, I think for format such as R (dput(), dget()) and dump, it's possible.

https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import_methods.R#L126-L145

Let me give it a test.

chainsawriot commented 7 months ago

I was wrong. Data will be forced to data frame anyway.

tmp <- tempfile(fileext = ".R")
dput(LETTERS, tmp)
rio::import(tmp)
#> NULL
#> <0 rows> (or 0-length row.names)

tmp <- tempfile(fileext = ".R")
dput(iris[1:2,], tmp)
rio::import(tmp)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa

tmp <- tempfile(fileext = ".dump")
dump(list = as.character(substitute(LETTERS)), file = tmp)
rio::import(tmp)
#> NULL
#> <0 rows> (or 0-length row.names)

Created on 2024-05-01 with reprex v2.1.0

chainsawriot commented 7 months ago

And of course, I have to pwn myself.

tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)
rio::import(tmp)
chainsawriot commented 7 months ago

@hrbrmstr @jsonbecker Maybe I've missed something, but it doesn't appear to be safe.

tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)

quarantine <- new.env(parent = emptyenv())
load(file = tmp,  envir = quarantine) ## not pwned yet

## load(file = tmp,  envir = quarantine); print(ls.str(quarantine, all.names = TRUE), max.level = 999, give.attr = TRUE) ## pwned
## load(file = tmp,  envir = quarantine); inherits(quarantine[["x"]], "data.frame") ## pwned
## load(file = tmp,  envir = quarantine); body(quarantine[["x"]]) ## pwned
## load(file = tmp,  envir = quarantine); class(quarantine[["x"]]) ## pwned
chainsawriot commented 7 months ago

traversc/qs#93 patched upstream.

traversc commented 7 months ago

If you want to ensure that a loaded object contains no PROMSXPs I think the best way would be to recursively inspect it using the C API. There may be a pure R way to do it but using C is more straightforward.

chainsawriot commented 7 months ago

@traversc Yeah, I think this is probably the only viable solution. But I also think that it is not a thing this package should do. Maybe it should be the task of R itself (as I said in HenrikBengtsson/Wishlist-for-R#162) or a new package with a promise-free version of load().

jsonbecker commented 7 months ago

@chainsawriot I think that’s right.

Another thought that may be “worthwhile” — maybe loading an RDS requires a new flag called something like “trust” — that defaults to FALSE and provides a message like “Deserializing RDS may lead to arbitrary code execution. You should only load RDS files you produced or trust. Set trust=TRUE to successfully load the data.”

chainsawriot commented 6 months ago

@jsonbecker That's a nice idea to nudge users into not using those serialization formats (even non-binary ones like .R and .dump) for I/O-ing data frames. At the same time, it's a breaking change. So, we should consider this in rio 2.x.

Python's pickle has this warning in the doc. We could recommend users using other formats. It circles back to the discussion of adding an open (but at the same time, safe) binary format as the default for storing data frame #315 . Now the "default" tier of our supported formats is in the sorry state of either proprietary (excel spss stata) or unsafe serialization formats.

jsonbecker commented 6 months ago

@jsonbecker That's a nice idea to nudge users into not using those serialization formats (even non-binary ones like .R and .dump) for I/O-ing data frames. At the same time, it's a breaking change. So, we should consider this in rio 2.x.

Python's pickle has this warning in the doc. We could recommend users using other formats. It circles back to the discussion of adding an open (but at the same time, safe) binary format as the default for storing data frame #315 . Now the "default" tier of our supported formats is in the sorry state of either proprietary (excel spss stata) or unsafe serialization formats.

Yup, I thought about that from a semver perspective. Another two choices here are:

  1. Deprecate this now, and commit to the functionality going away in 2.0. OR
  2. Implement the trust flag with a default to true right now, with a message whenever trust is a missing argument that rio 2.x will change this to default to false and encourage that if you are continuing to use this feature in your code, you specify the trust flag to ensure future compatibility.
chainsawriot commented 6 months ago

Even with this d63ccb5e9a7c760516aefc1e35db3aaa96b48719 one can still be pwned with import_list(). The trust mechanism is not implemented in import_list().

tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)

import_list(tmp, trust = FALSE)

https://github.com/gesistsa/rio/blob/ca019c903d470f3a3e6a6b21a511e336812e18dd/R/import_list.R#L82-L90