How do we handle (insecure) RData objects as input to tools

bgruening commented 3 years ago

Hi IUC. Could you please have a look at https://github.com/galaxyproject/tools-iuc/pull/3859

It's the old discussion about the security of RData objects. Do we allow in IUC tools RData as input datasets. This implies in the worst case that some hand-crafted RData object can be uploaded into Galaxy and could do some (no clue which) harm.

Side note, we have the same problem with Python objects, for example in with Machine Learning models. We put a lot of effort into allow-listing functions and objects and skip others to make it secure.

We few thoughts from my side.

I think in an ideal world it would be nice if we can avoid RData objects, but realistically this raises the bar for tools to enter the usegalaxy.* servers significantly.
we can say IUC does not allow this and recommend using some other converters (in this example some Loom converters), but this is not always possible
even if IUC does not allow this, we will have highly used tools that will make use of that and entire communities --> metabolomics for example
we could add a new tag into the tool, e.g. <tool .... potentially_insecure_data_handling="true"> which forces Galaxy (if configured) to use containers
- no clue how this would work with destination configurations
- no clue if this would be secure enough, but at least we could indicate this to the Galaxy admin in a structured way

bgruening commented 3 years ago

ping @galaxyproject/iuc @jmchilton @natefoo @gmauro

bernt-matthias commented 3 years ago

What I'm missing here is a reference or example how to do harm with rdata objects. In particular for the case where you load it and assign it to a single variable (but the case where you load an environment might also be interesting).

bgruening commented 3 years ago

@bernt-matthias I don't have a good reference at hand. Maybe @blankenberg has? Maybe that one https://www.refsmmat.com/courses/751/notes/data-storage.html

My understanding is that RDS/Rdata is a serialization of your data and objects, pretty much like pickles. 1) it's not guaranteed that you can read this data between different R/Python versions. 2) you load objects that can execute some insecure code.

bernt-matthias commented 3 years ago

I see the disadvantages (portability, readability ...), but I never got the security problems if saveRDS and readRDS are used.

hexylena commented 3 years ago

I discussed this the other day on help.galaxyproject.org with @sveinugu, his solution looked like this: https://help.galaxyproject.org/t/possible-to-maintain-a-galaxy-server-specific-secret-key-across-tools-and-job-invocations/6234/20

Actually, since app is available, I believe I don’t even need to get the SECRET_KEY extracted at all. Instead, I could just call app.security.encode_id(myfilehash) in the wrapper of tool A and store this in the output dataset somehow (e.g. in a YAML file), and in the tool B wrapper, I could extract this encoded string from the input dataset and call app.security.decode_id(myencodedfilehash) to get the original file hash, which I can then use to validate the file contents. Would that not be both secure and transparent enough? It would also not require any changes to the dependencies or the command line, nor would I need to store anything on the server.

We could consider a solution like this and then feel more safe about the security of those objects, since we could prove that the dataset has not been modified from one of the 'safe' tools. You couldn't upload one of these datasets and process it, but, I think the tradeoffs are acceptable enough?

This would let us side-step a lot of the issues around the security of these datatypes.

Perhaps that could even be a galaxy feature, something like "assert this dataset came from this tool ID on this server" to further allow-list specific tools to create the datasets.

mvdbeek commented 3 years ago

we could add a new tag into the tool, e.g. <tool .... potentially_insecure_data_handling="true"> which forces Galaxy (if configured) to use containers

no clue how this would work with destination configurations

no clue if this would be secure enough, but at least we could indicate this to the Galaxy admin in a structured way

We certainly need this, it came up before, and that is something we can do without much headaches. In a sense ITs work like that, right ?

bgruening commented 3 years ago

We certainly need this, it came up before, and that is something we can do without much headaches. In a sense ITs work like that, right ?

Yes, ITs have much more magic and are annotated and have explicitly a container defined. But in a way it is comparable. Any suggestion for such an option to make it more generic? containers="recommended" instead of true. Or really something about a security level?

bgruening commented 3 years ago

We could consider a solution like this and then feel more safe about the security of those objects, since we could prove that the dataset has not been modified from one of the 'safe' tools. You couldn't upload one of these datasets and process it, but, I think the tradeoffs are acceptable enough?

I think the last time we discussed encoding or better signing datasets we didn't liked that you can not share the history afterwards with other Galaxy instances. Or might have problems if you export the history? Signing datasets might be interesting, but this involves a huge own subsystem (which we might need eitherway for the sensitive data use case).

hexylena commented 3 years ago

you can not share the history afterwards with other Galaxy instances.

you could, you'd just have to re-run the relevant portions on that server. And exporting would include all the datasets so it should be fine? Just re-run all and you're good to go

I think @sveinugu will implement it pretty simply without a huge subsystem, just by signing a hash of the dataset, and then checking that that dataset hash still verifies before execution, which seems pretty lightweight?

But yeah fair point if you want really signed DSs for sensitive use cases, then it'd definitely need a bigger system.

mvdbeek commented 3 years ago

I think @sveinugu will implement it pretty simply without a huge subsystem, just by signing a hash of the dataset, and then checking that that dataset hash still verifies before execution, which seems pretty lightweight?

Are we talking about the dataset hash sums ? Those are only calculated for a small subset of datasets (I think uploads only at this point ?), but that would be something to build on if it fits the bill. Reaching into app is a pretty good sign that we should not be doing or recommending something like that.

hexylena commented 3 years ago

No, not the built in hashes. (But we could build on that in the future)

His plan was:

tool A/producer
- hash the dataset (sha256 or so)
- encode that hash with the galaxy id encoding mechanism
- send that with the output
tool B/consumer
- check that the sent hashes decode properly via app (proving that it was produced by tool A on that server)
- and match the sha256 of the input dataset (proving that it's untampered)
- die if not.

Reaching into app is a pretty good sign

agreed! I'd love a more generic mechanism for proving that a dataset has been untampered with since coming from another tool.

bernt-matthias commented 1 year ago

This just popped up in a IWC meeting again.

I think in an ideal world it would be nice if we can avoid RData objects

In my opinion tools should use RDS which is a serialization of a single R object which is assigned to a single variable when loaded (in contrast to RData which loads a complete environment).

For tools using Rdata containerization might be an idea/solution.

it's not guaranteed that you can read this data between different R/Python versions

Both datatypes are versioned and the version is stored in the datasets metadata. So it is possible to load only RDS / Rdata files which can be read by the R version used in a tool.

galaxyproject / tools-iuc

How do we handle (insecure) RData objects as input to tools #3921