Add function to upload file

ymarcon commented 2 years ago

There is a need for sending large data chunks to the server (ML): https://datashield.discourse.group/t/how-to-pass-models-between-data-scientists-and-nodes-hospitals/440

Instead of working around the current API by serializing the data chunk and passing it as function parameter, the proposal is to add the ability to upload a file from the R/DS client to the R/DS server. It is the responsibility of the server-side function loading analyzing the uploaded file to ensure it is not malicious.

Function definition

The function signature would be:

datashield.upload(conns, source, destination = NULL)

Where:

conns is the list of DS connections
source is the path to the local (client) file to upload
destination is the path to the destination, relative to the R server session's work directory. It omitted, the destination is the source's file name.

Examples:

datashield.upload(conns, 'someData.rds')
# subfolders are automatically created when needed
datashield.upload(conns, 'someData.rds', 'myfolder/data.rds')

The following would fail because destination is not a subfolder of the R server session's workdir:

datashield.upload(conns, 'someData.rds', '/myfolder/data.rds')
datashield.upload(conns, 'someData.rds', '../myfolder/data.rds')

A subsequent function call will make use of the uploaded file:

datashield.assign.expr(conns, symbol = "someData", expr = quote(loadMyData('someData.rds')))

DataSHIELD node settings

The data manager should be able to disable this file upload feature at the DataSHIELD profile level.

ymarcon commented 2 years ago

@StuartWheater @davraam @tombisho Do you see any privacy preserving reasons for not having this feature?

tombisho commented 2 years ago

I've been following the conversation on the forum with interest because it is also relevant to dsSwissKnife. They actually encode all function arguments as a long string passed to the serverside function, and then decode them again. On one side this is an easy way of passing a bunch of complicated arguments and not getting caught in the parser.... but then it partly defeats the point of the parser. I guess it is bypassing in a known context and the parser still operates on general aggregate or assign calls. If I had time I was going to see if I could hide something malicious in a call to one of their functions and raise this as an issue....

On the actual question, how would the server side function make sure the contents is not malicious? Run some kind of virus scan?

Could your .rds object in fact hold a function that does something you don't want it to do?

ymarcon commented 2 years ago

Opal can't analyse the content of this file. It must be the responsibility of the server-side function.

StuartWheater commented 2 years ago

Concern has been expressed about malevolent client being able to upload variable with contains which aid disclosure, for example, vectors containing a few '1' and many '0'.

Were you thinking of having a new type of methods list, in addition to "assign" and "aggregate", for "load" methods.

pb51 commented 2 years ago

Hi all I think we should discuss this as part of the forthcoming discussion on disclosure - which is now being organised as everyone is back from vacation. As you know I have a recently written direct function that can transfer long blocks of code if they are in the proper format of a matrix data frame or tibble as a long argument of a client side function and I like Yannick’s load method as an alternative in some cases. As yannick says In BOTH cases we need the server side function to scrutinise but I hadn’t seen the uploading of judicious 0/1 variables as the main threat here. These are only a substantive problem if you already know enough to know where to put the 1s. Given current flexibility in the server side transformation functions it is too easy to do this in some situations and this is one of the main drivers of our current discussions about disclosure and how to combat it. Until we have tightened up on the server side someone coming in with a dangerous 0/1 vector via load or client server approach would either already have to know a lot about a given dataset (more than they should if the sensitive dataset has been properly managed) or to have undertaken the sort of analysis on the server side which we know we need to identify and block anyway.

To address this issue we need to develop systems on the server side which can do this work. We have a range of proposed approaches which we will discuss at the forthcoming disclosure meeting and can prioritise our response. But Stuart is already working 25 hours a day and not only do we need to ensure his funding but we need to seek additional funding to carry out some of the new work properly. So let’s see what comes out of the disclosure meeting.

P

On Mon, 20 Sep 2021 at 22:09, Stuart Wheater @.***> wrote:

Concern has been expressed about malevolent client being able to upload variable with contains which aid disclosure, for example, vectors containing a few '1' and many '0'.

Were you thinking of having a new type of methods list, in addition to "assign" and "aggregate", for "load" methods.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datashield/DSI/issues/12#issuecomment-923312216, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAW4OON3L73LHTS5N774W3UC6PHDANCNFSM5EL4GKOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Paul Burton, Professor of Data Science for Health

D2K Research Group, Institute of Health and Society, Newcastle University,

Baddiley-Clark Building, Richardson Road, Newcastle upon Tyne, NE2 4AX, United Kingdom

Telephone (for IHS): (0)191 208 7045

Email: @.***

pb51 commented 2 years ago

Thanks Tom. It would certainly be useful to try seeing what malicious things can be got through. If you want to test it on a real function you could try my new c2sdmt function as a convenient vehicle for trying to hide malicious code. As soon as we identify a class of dangerous hidden elements we need to develop code to block them. Fortunately because the server side function MUST interpret the code before it can do anything I am fairly confident that we are going to be able to develop blocking code (but not for the 0/1 vector approach because we have to allow 0/1 vectors and combatting them requires the more generic approaches - possibly based on AI

which interrogate the function calls and nature of new vectors on the server side as well as the log files.

On Mon, 20 Sep 2021 at 17:22, tombisho @.***> wrote:

I've been following the conversation on the forum with interest because it is also relevant to dsSwissKnife. They actually encode all function arguments https://github.com/sib-swiss/dsSwissKnifeClient/blob/master/R/dssCoxph.R#L41 as a long string passed to the serverside function, and then decode them again. On one side this is an easy way of passing a bunch of complicated arguments and not getting caught in the parser.... but then it partly defeats the point of the parser. I guess it is bypassing in a known context and the parser still operates on general aggregate or assign calls. If I had time I was going to see if I could hide something malicious in a call to one of their functions and raise this as an issue....

On the actual question, how would the server side function make sure the contents is not malicious? Run some kind of virus scan?

Could your .rds object in fact hold a function that does something you don't want it to do?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datashield/DSI/issues/12#issuecomment-923079452, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAW4OLETHBQIEB2XMOLXBDUC5NU5ANCNFSM5EL4GKOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Paul Burton, Professor of Data Science for Health

D2K Research Group, Institute of Health and Society, Newcastle University,

Baddiley-Clark Building, Richardson Road, Newcastle upon Tyne, NE2 4AX, United Kingdom

Telephone (for IHS): (0)191 208 7045

Email: @.***

ymarcon commented 2 years ago

@StuartWheater the idea is not to have an additional method type. Opal will only take care of the file transfer operation; the file usage (whatever it means) is to be performed by a dedicated assign or an aggregate method.

tombisho commented 2 years ago

You can save a function into an .rds file. So the server side function will need to block that, perhaps? Assuming that if you have a function defined in the Global environment, you could then call it with datashield.aggregate?

On my comment about hiding malicious code in function arguments, I've had a think about it and it seems quite challenging to actually do any harm.

ymarcon commented 2 years ago

What the resourcer package does when reading a file of format:

RDS : read the R object using readRDS() and verifies it has the expected class (function would be rejected for instance)
RData: load() the data in a private environment and inspect for the objects in this environment to find the one with the expected class

A server side package receiving a file to read could do the same (or could use the resourcer package to read an object from a local file!).

datashield / DSI

Add function to upload file #12

Function definition

DataSHIELD node settings