Open bgruening opened 3 years ago
ping @galaxyproject/iuc @jmchilton @natefoo @gmauro
What I'm missing here is a reference or example how to do harm with rdata objects. In particular for the case where you load it and assign it to a single variable (but the case where you load an environment might also be interesting).
@bernt-matthias I don't have a good reference at hand. Maybe @blankenberg has? Maybe that one https://www.refsmmat.com/courses/751/notes/data-storage.html
My understanding is that RDS/Rdata is a serialization of your data and objects, pretty much like pickles. 1) it's not guaranteed that you can read this data between different R/Python versions. 2) you load objects that can execute some insecure code.
I see the disadvantages (portability, readability ...), but I never got the security problems if saveRDS and readRDS are used.
I discussed this the other day on help.galaxyproject.org with @sveinugu, his solution looked like this: https://help.galaxyproject.org/t/possible-to-maintain-a-galaxy-server-specific-secret-key-across-tools-and-job-invocations/6234/20
Actually, since app is available, I believe I don’t even need to get the SECRET_KEY extracted at all. Instead, I could just call app.security.encode_id(myfilehash) in the wrapper of tool A and store this in the output dataset somehow (e.g. in a YAML file), and in the tool B wrapper, I could extract this encoded string from the input dataset and call app.security.decode_id(myencodedfilehash) to get the original file hash, which I can then use to validate the file contents. Would that not be both secure and transparent enough? It would also not require any changes to the dependencies or the command line, nor would I need to store anything on the server.
We could consider a solution like this and then feel more safe about the security of those objects, since we could prove that the dataset has not been modified from one of the 'safe' tools. You couldn't upload one of these datasets and process it, but, I think the tradeoffs are acceptable enough?
This would let us side-step a lot of the issues around the security of these datatypes.
Perhaps that could even be a galaxy feature, something like "assert this dataset came from this tool ID on this server" to further allow-list specific tools to create the datasets.
we could add a new tag into the tool, e.g.
<tool .... potentially_insecure_data_handling="true">
which forces Galaxy (if configured) to use containers
- no clue how this would work with destination configurations
- no clue if this would be secure enough, but at least we could indicate this to the Galaxy admin in a structured way
We certainly need this, it came up before, and that is something we can do without much headaches. In a sense ITs work like that, right ?
We certainly need this, it came up before, and that is something we can do without much headaches. In a sense ITs work like that, right ?
Yes, ITs have much more magic and are annotated and have explicitly a container defined. But in a way it is comparable.
Any suggestion for such an option to make it more generic? containers="recommended"
instead of true
. Or really something about a security level?
We could consider a solution like this and then feel more safe about the security of those objects, since we could prove that the dataset has not been modified from one of the 'safe' tools. You couldn't upload one of these datasets and process it, but, I think the tradeoffs are acceptable enough?
I think the last time we discussed encoding or better signing datasets we didn't liked that you can not share the history afterwards with other Galaxy instances. Or might have problems if you export the history? Signing datasets might be interesting, but this involves a huge own subsystem (which we might need eitherway for the sensitive data use case).
you can not share the history afterwards with other Galaxy instances.
you could, you'd just have to re-run the relevant portions on that server. And exporting would include all the datasets so it should be fine? Just re-run all and you're good to go
I think @sveinugu will implement it pretty simply without a huge subsystem, just by signing a hash of the dataset, and then checking that that dataset hash still verifies before execution, which seems pretty lightweight?
But yeah fair point if you want really signed DSs for sensitive use cases, then it'd definitely need a bigger system.
I think @sveinugu will implement it pretty simply without a huge subsystem, just by signing a hash of the dataset, and then checking that that dataset hash still verifies before execution, which seems pretty lightweight?
Are we talking about the dataset hash sums ? Those are only calculated for a small subset of datasets (I think uploads only at this point ?), but that would be something to build on if it fits the bill. Reaching into app is a pretty good sign that we should not be doing or recommending something like that.
No, not the built in hashes. (But we could build on that in the future)
His plan was:
Reaching into app is a pretty good sign
agreed! I'd love a more generic mechanism for proving that a dataset has been untampered with since coming from another tool.
This just popped up in a IWC meeting again.
I think in an ideal world it would be nice if we can avoid RData objects
In my opinion tools should use RDS which is a serialization of a single R object which is assigned to a single variable when loaded (in contrast to RData which loads a complete environment).
For tools using Rdata containerization might be an idea/solution.
it's not guaranteed that you can read this data between different R/Python versions
Both datatypes are versioned and the version is stored in the datasets metadata. So it is possible to load only RDS / Rdata files which can be read by the R version used in a tool.
Hi IUC. Could you please have a look at https://github.com/galaxyproject/tools-iuc/pull/3859
It's the old discussion about the security of RData objects. Do we allow in IUC tools RData as input datasets. This implies in the worst case that some hand-crafted RData object can be uploaded into Galaxy and could do some (no clue which) harm.
related: https://github.com/galaxyproject/galaxy/pull/12336 xref: https://github.com/usegalaxy-eu/usegalaxy-eu-tools/pull/411#issuecomment-912474989
Side note, we have the same problem with Python objects, for example in with Machine Learning models. We put a lot of effort into allow-listing functions and objects and skip others to make it secure.
We few thoughts from my side.
<tool .... potentially_insecure_data_handling="true">
which forces Galaxy (if configured) to use containers