Open monkeypants opened 2 years ago
as a quick solution we may validate that the incoming file is a valid image of supported list of formats (jpeg, png, webm) - I haven't seen any dangerous jpegs recently full-scale antivirus protection may be overkill at the moment
we don't need to do it now - not until we want to go to public BETA and start inviting strangers to use it. Initial users will be friends or people under professional conduct obligations.
https://stackabuse.com/step-by-step-guide-to-file-upload-with-flask/
Avoiding devious file paths is a good idea
from werkzeug.utils import secure_filename
secure_filename = secure_filename(file.filename)
and the maximum file size prevents a kind of DOS.
The file name whitelisting is not a bad idea
ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif'}
def allowed_file(filename):
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
howver any file can have a valid name (e.g. somefile.jpeg
), even if it's a wolf in sheep's clothing.
But we might want to also test that it is actually a file of the type that it's name says that it is. Maybe PIL could be a cheap way to do it, e.g.
from PIL import Image
try:
im = Image.open(filename)
is_image = True
except IOError:
is_image = False
somewhere around here might be a good place.
It's still not as defensive as a quarantine / virus scanner pattern, but we could add that later.
Metadata from photos uploaded to the platform should also be scrubbed of any data that can uniquely identify the contributor. Obviously we need the good metadata but somehow we need GDPR type compliance too and avoid all the new AI rules coming into effect about personal and biometric data
On Fri, Apr 15, 2022, 3:30 PM Chris Gough @.***> wrote:
we are processing files that get uploaded to our system by people.
We should be cautious. Uploaded files should ;and in a quarantine, then get scanned for malware before being allowed to interact with any other process (generate thumbnails per: #50 https://github.com/Tech4Tracing/CartridgeOCR/issues/50, metadata extraction per #48 https://github.com/Tech4Tracing/CartridgeOCR/issues/48, etc).
— Reply to this email directly, view it on GitHub https://github.com/Tech4Tracing/CartridgeOCR/issues/54, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZR5LPPPVKQRTHVL6EVEVTVFFVORANCNFSM5TQOTNDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
If the image contains personal information, it's a difficult job to scrub that out. But we can process the EXIF metadata in jpegs. For example:
from exif import Image
if image.has_exif:
# we could do this to make it clean
image.delete_all()
I think PIL also has a method for scrubbing out the metadata. But maybe we don't want to scrub out all the metadata? Info about camera and lens type might be useful. I don't suppose GPS coordinates + date/time would ever be used (insert spy movie about plot twist about measuring shadow angles etc).
Perhaps we should be redacting the EXIF data from the images, so we know they are not leaking PII in EXIF data, but putting it into a separate file and processing them with less of a "shotgun surgery" approach; stripping out PII while also keeping metadata that's useful for making more effective models in the future.
we are processing files that get uploaded to our system by people.
We should be cautious. Uploaded files should ;and in a quarantine, then get scanned for malware before being allowed to interact with any other process (generate thumbnails per: #50, metadata extraction per #48, etc).