Prevent Upload of dangerous file types like .exe, .bat

amitupreti commented 1 year ago

Problem: Our current setup allows authors to upload any types of files in the project, and these files can be downloaded without any warnings by other users like reviewers/editors/authors which can be dangerous in case the project author has submitted some kind of executable file that can make changes to system.

Solution/Discussion: So it seems we might have to think of a way to modify data upload process to reduce risks, or at least show warnings to users when they download certain types of files.

Some points from discussion with Tom

on the main PhysioNet deployment there are some files with the .exe, .bat extensions which is not a good idea, so we should restrict certain files and then add a comment explaining why. We certainly don’t want to publish executable files.
A side note when working on this, we could also think about adding a functionality that let's us do certain kinds of checks such as if the uploaded files.
- For example if the uploaded file is doc, pdf, etc, trigger functions that can check if there is some kind of PHI information(protected health information (e.g. a name or date in a file))
- Or if the uploaded file is PDF make sure its a pdf(The file's header should be checked to ensure it matches a PDF file's signature.) - This comes from the TRA healthdatanexus did.(L01)

Implementation/Ideas: 1, We could update the view generate_signed_url and create a separate function say analyze_file, that does all kinds of checks on files. So in future we could just keep adding checks/logic on the separate function, which is used by the upload view generate_signed_url and probably use django messages to throw errors

For showing erros maybe we can have a static JSON files with error codes, and use that with the output of analyze_file to throw appropriate errors to the uploader

bemoody commented 1 year ago

To the contrary, we publish a lot of software which is executable.

There is no reason you should be less frightened by an executable Python script than by a DOS batch file.

We do not want to publish compiled code because it cannot be inspected or modified and it bit-rots. That's a matter of policy, though. It's also quite impossible to reliably recognize compiled code in all the many forms it may take.

Editors are responsible for carefully handling untrusted files. That includes not downloading random untrusted .exe files and running them. It also includes not downloading random untrusted .py files and running them.

The platform cannot prevent editors from doing dumb things, and if we try to pretend it can, we create a dangerous false sense of security.

alistairewj commented 1 year ago

Yeah, this was raised by a security audit and I'm not sure how best to address it. Allowing executable files to be uploaded carries risk (obviously), but it seems to be a part of the game when running a publishing platform. Their immediate suggestion was to mitigate it by blocking certain file types, which doesn't fit well with the use of the platform.

Is it possible to reduce the risk in other ways? My thoughts are:

Warn the reviewer when downloading certain file types. Probably stuff that automatically executes when double clicked on windows (.exe., .bat). We could simply add this to the text which displays on the file preview when they click a file. However if we did that, we would probably also have to give a modal pop-up if they clicked the direct download icon next to the file.
Run a malicious code scanner, e.g. ClamAV, on the files after upload. For the HDN deployment, we can run this on GCP independent of the platform. For PhysioNet, you could run it on the server within docker, or just not run it at all since it may not be a priority for you.

bemoody commented 1 year ago

The days of people downloading exe files and having them run automatically are long past. As far as I'm aware, every Windows browser sets the appropriate magic flag to ensure that Windows will warn you before letting you execute the file. This feature is so old that it's hard to find search results about it - I think the feature is called "zone identifiers" and might have been added as recently as Windows XP SP2 (2004).

Running clamav isn't necessarily a bad idea. It doesn't really mean anything, though, in terms of protection from malicious uploaders, and I'd also be a bit concerned about the time it would take to scan a multi-TB database.

That said, I've long wanted a framework for automatically running checks on file content. Checking known file formats, detecting obvious PHI, checking that admins didn't break the permissions... the list goes on. Virus scanning could be part of that too.

What I would imagine would be to have additional states for unpublished projects, e.g.:

unsubmitted
submitted, pending checks
awaiting editor assignment
awaiting decision
awaiting copyedit
copyedited, pending checks
awaiting author approval
awaiting publication

There would be some automated process (using background_tasks, maybe) that would run the checks and advance the project to the next stage, or else report errors and kick it back to the previous stage.

MIT-LCP / physionet-build

Prevent Upload of dangerous file types like .exe, .bat #1730