Open chainsawriot opened 7 months ago
@schochastics @jsonbecker thoughts?
In my opinion, rio
's intent is to load data.frame like objects, not restore R sessions. Support for RDS is support for one way someone might save those kind of objects. I'd be comfortable allow-listing certain classes of objects on both write and load and specify that, possibly with a message
. If the intent is full session storage, this is not the right method/package.
I mention allow list because I think we'd want to include things like data.table
and tibble
objects.
This would be a breaking change, potentially, and may warrant a major version number update if we wanted to be closer to semver.
I just made https://github.com/hrbrmstr/rdaradar and I think doing some quarantining of R data files in the rio load process to then pick out "data frame-like objects" might be the way to go
I just made hrbrmstr/rdaradar and I think doing some quarantining of R data files in the rio load process to then pick out "data frame-like objects" might be the way to go
So in this case, we'd:
new.env()
Are there any concerns about how R makes it... incredibly simple to just declare you're one class (along with others)? Put another way, should we restrict certain classes explicitly as well, regardless of whether they also declare they are a data.frame-alike?
Thank you all for the feedback.
Specifically, this is the line that makes exceptions for the serialization formats.
https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L146-L148
And for other formats, rio will force the output to be data.frame. So anything that cannot be converted to data frame would produce an error.
SEE BELOW
https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/set_class.R#L43-L52
We do not have tests for that, I think for format such as R
(dput()
, dget()
) and dump
, it's possible.
Let me give it a test.
I was wrong. Data will be forced to data frame anyway.
tmp <- tempfile(fileext = ".R")
dput(LETTERS, tmp)
rio::import(tmp)
#> NULL
#> <0 rows> (or 0-length row.names)
tmp <- tempfile(fileext = ".R")
dput(iris[1:2,], tmp)
rio::import(tmp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
tmp <- tempfile(fileext = ".dump")
dump(list = as.character(substitute(LETTERS)), file = tmp)
rio::import(tmp)
#> NULL
#> <0 rows> (or 0-length row.names)
Created on 2024-05-01 with reprex v2.1.0
And of course, I have to pwn myself.
tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)
rio::import(tmp)
@hrbrmstr @jsonbecker Maybe I've missed something, but it doesn't appear to be safe.
tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)
quarantine <- new.env(parent = emptyenv())
load(file = tmp, envir = quarantine) ## not pwned yet
## load(file = tmp, envir = quarantine); print(ls.str(quarantine, all.names = TRUE), max.level = 999, give.attr = TRUE) ## pwned
## load(file = tmp, envir = quarantine); inherits(quarantine[["x"]], "data.frame") ## pwned
## load(file = tmp, envir = quarantine); body(quarantine[["x"]]) ## pwned
## load(file = tmp, envir = quarantine); class(quarantine[["x"]]) ## pwned
traversc/qs#93 patched upstream.
If you want to ensure that a loaded object contains no PROMSXPs I think the best way would be to recursively inspect it using the C API. There may be a pure R way to do it but using C is more straightforward.
@traversc Yeah, I think this is probably the only viable solution. But I also think that it is not a thing this package should do. Maybe it should be the task of R itself (as I said in HenrikBengtsson/Wishlist-for-R#162) or a new package with a promise-free version of load()
.
@chainsawriot I think that’s right.
Another thought that may be “worthwhile” — maybe loading an RDS requires a new flag called something like “trust” — that defaults to FALSE and provides a message like “Deserializing RDS may lead to arbitrary code execution. You should only load RDS files you produced or trust. Set trust=TRUE
to successfully load the data.”
@jsonbecker That's a nice idea to nudge users into not using those serialization formats (even non-binary ones like .R
and .dump
) for I/O-ing data frames. At the same time, it's a breaking change. So, we should consider this in rio
2.x.
Python's pickle
has this warning in the doc. We could recommend users using other formats. It circles back to the discussion of adding an open (but at the same time, safe) binary format as the default for storing data frame #315 . Now the "default" tier of our supported formats is in the sorry state of either proprietary (excel spss stata) or unsafe serialization formats.
@jsonbecker That's a nice idea to nudge users into not using those serialization formats (even non-binary ones like
.R
and.dump
) for I/O-ing data frames. At the same time, it's a breaking change. So, we should consider this inrio
2.x.Python's
pickle
has this warning in the doc. We could recommend users using other formats. It circles back to the discussion of adding an open (but at the same time, safe) binary format as the default for storing data frame #315 . Now the "default" tier of our supported formats is in the sorry state of either proprietary (excel spss stata) or unsafe serialization formats.
Yup, I thought about that from a semver perspective. Another two choices here are:
true
right now, with a message whenever trust
is a missing argument that rio
2.x will change this to default to false
and encourage that if you are continuing to use this feature in your code, you specify the trust
flag to ensure future compatibility.Even with this d63ccb5e9a7c760516aefc1e35db3aaa96b48719 one can still be pwned with import_list()
. The trust mechanism is not implemented in import_list()
.
tmp <- tempfile(fileext = ".Rdata")
delayedAssign("x", system("gnome-calculator"))
save(x, eval.promises = FALSE, file = tmp)
import_list(tmp, trust = FALSE)
CVE-2024-27322 is partially fixed in R 4.4.0. But the attack surface is still there. First, this package supports R > 3.6 therefore the partial fix in 4.4.0 is not applied in many supported versions. Second, even with 4.4.0 deserialization of
.Rdata
and.RDS
in some cases can still invoke arbitrary code execution. See this message by gws.We can't be too nanny but should we rethink the "any R object" policy of
.Rdata
https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L31
.RDS
https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L32
and
qs
https://github.com/gesistsa/rio/blob/c529994b479f2014e64d9dc4d4c21f4032c48ca1/R/import.R#L33-L35
There are several options: