Tech4Tracing / CartridgeOCR

https://tech4Tracing.org
MIT License
3 stars 5 forks source link

implement malware scanning for all files uploaded by users #54

Open monkeypants opened 2 years ago

monkeypants commented 2 years ago

we are processing files that get uploaded to our system by people.

We should be cautious. Uploaded files should ;and in a quarantine, then get scanned for malware before being allowed to interact with any other process (generate thumbnails per: #50, metadata extraction per #48, etc).

koriaf commented 2 years ago

as a quick solution we may validate that the incoming file is a valid image of supported list of formats (jpeg, png, webm) - I haven't seen any dangerous jpegs recently full-scale antivirus protection may be overkill at the moment

monkeypants commented 2 years ago

we don't need to do it now - not until we want to go to public BETA and start inviting strangers to use it. Initial users will be friends or people under professional conduct obligations.

simra commented 1 year ago

https://stackabuse.com/step-by-step-guide-to-file-upload-with-flask/

monkeypants commented 1 year ago

Avoiding devious file paths is a good idea

from werkzeug.utils import secure_filename
secure_filename = secure_filename(file.filename)

and the maximum file size prevents a kind of DOS.

The file name whitelisting is not a bad idea

ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif'}
def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

howver any file can have a valid name (e.g. somefile.jpeg), even if it's a wolf in sheep's clothing.

But we might want to also test that it is actually a file of the type that it's name says that it is. Maybe PIL could be a cheap way to do it, e.g.

from PIL import Image

try:
    im = Image.open(filename)
    is_image = True
except IOError:
    is_image = False

somewhere around here might be a good place.

It's still not as defensive as a quarantine / virus scanner pattern, but we could add that later.

eelebrun commented 1 year ago

Metadata from photos uploaded to the platform should also be scrubbed of any data that can uniquely identify the contributor. Obviously we need the good metadata but somehow we need GDPR type compliance too and avoid all the new AI rules coming into effect about personal and biometric data

On Fri, Apr 15, 2022, 3:30 PM Chris Gough @.***> wrote:

we are processing files that get uploaded to our system by people.

We should be cautious. Uploaded files should ;and in a quarantine, then get scanned for malware before being allowed to interact with any other process (generate thumbnails per: #50 https://github.com/Tech4Tracing/CartridgeOCR/issues/50, metadata extraction per #48 https://github.com/Tech4Tracing/CartridgeOCR/issues/48, etc).

— Reply to this email directly, view it on GitHub https://github.com/Tech4Tracing/CartridgeOCR/issues/54, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZR5LPPPVKQRTHVL6EVEVTVFFVORANCNFSM5TQOTNDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

monkeypants commented 1 year ago

If the image contains personal information, it's a difficult job to scrub that out. But we can process the EXIF metadata in jpegs. For example:

from exif import Image

if image.has_exif:
    # we could do this to make it clean
    image.delete_all()

I think PIL also has a method for scrubbing out the metadata. But maybe we don't want to scrub out all the metadata? Info about camera and lens type might be useful. I don't suppose GPS coordinates + date/time would ever be used (insert spy movie about plot twist about measuring shadow angles etc).

Perhaps we should be redacting the EXIF data from the images, so we know they are not leaking PII in EXIF data, but putting it into a separate file and processing them with less of a "shotgun surgery" approach; stripping out PII while also keeping metadata that's useful for making more effective models in the future.