getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
13.46k stars 944 forks source link

Processing images for publishing #838

Open mkroman opened 4 years ago

mkroman commented 4 years ago

Much like how Zola has the ability to resize images, it would be nice if it was able to process images and wipe any sensitive EXIF data, such as geo-coordinates.

This shouldn't be too hard to implement as it can be done similar to this example just by using a process_image function.

There's one caveat to this, though! Stripping the EXIF data can mean losing orientation information as a lot of modern smartphones simply save the orientation as metadata rather than rotate the image.

Another non-trivial concern is that people like me who publish the blog source might check in an image with the same sensitive data.

One solution to this could be recommending a default .gitignore with a directory specifically for images like this, and process_image could work even when the source image isn't found, and it will skip the processing if the determined output image already exists in the repo. It's not a perfect solution, though.

Keats commented 4 years ago

There's one caveat to this, though! Stripping the EXIF data can mean losing orientation information as a lot of modern smartphones simply save the orientation as metadata rather than rotate the image.

That would have been too simple :(

There are already 2 images functions (resize and get_image_metadata) so I'm not against adding one more but it's becoming a bit hard to understand how they work together: can I combine them? Considering they would both create an image that seems like it would inflate the size of the repo/site. A function to read EXIF data could be super useful too, at least to get lat/long/date.

One solution to this could be recommending a default .gitignore with a directory specifically for images like this, and process_image could work even when the source image isn't found, and it will skip the processing if the determined output image already exists in the repo. It's not a perfect solution, though.

I think in that case it would be better for the users to trim exif data themselves, otherwise they get a broken site if they lose their HDD...

cc @vojtechkral

vojtechkral commented 4 years ago

Hm. Filtering geo coords might be useful, but it seems like a fairly specialized thing...

mkroman commented 4 years ago

I think in that case it would be better for the users to trim exif data themselves, otherwise they get a broken site if they lose their HDD...

Not necessarily. If the generated exifless-image uses a reproducible output path that gets checked in, it's possible to check if a file already exists at this generated output path before doing any work, so even if the original image is lost, the generated image should still exist and be valid.

On another project, I'm already using nix tools to accomplish this using a basic execution pipeline - I use exiftran -i -a <file> to rotate the image according to exif metadata and exiv2 rm <file> to remove exif metadata afterwards - this could be provided as an example solution and it would solve the problem, but it's very platform specific and I'd argue that there's practically no good reason for leaving these possibly very private details in any published images on the internet in general* - unless the user explicitly requests it.

kellpossible commented 4 years ago

To solve the rotation exif issue perhaps we could have an option to apply exif rotations to the image data. I've already run into some issues where images from smartphones don't display correctly in the browser due to the rotation being applied in the exif.

mkroman commented 4 years ago

I now have a different point of view to this problem; instead of messing around with .gitignore, pre-processing, repo hooks, or dynamic zola functions - why not make zola a complete command-line tool akin to how cargo or similar CLIs are used today?

I propose that zola either adds dynamic binary execution (i.e. zola something executes zola-something if found in $PATH akin to how git or cargo does it; details unclear) in order to expand its functionality - or - preferably, instead it embeds more CLI features that may be feature gated (such that each release includes 2 binaries: zola-serve (functionally equivalent to how it is now, used in production) and zola which includes zola-serve and all additional user CLI features))

On top of this CLI we could add a command à la zola image add ~/Pictures/IMG_20191006_221543.jpg which would, assuming we're in the project root directory, do the following:

With this solution the end image is written to the (possibly public) project without leaking private metadata in a (relatively) reproducible manner that is CDN-friendly.

mkroman commented 4 years ago

A naive look into the currently related crates yields img-parts which at first glance seems to offer an API to set a nullifying exif writer (i.e. jpeg.set_exif(None) in this example - mayhaps exif metadata could become a part of the image specs?

Keats commented 4 years ago

I propose that zola either adds dynamic binary execution (i.e. zola something executes zola-something if found in $PATH akin to how git or cargo does it; details unclear) in order to expand its functionality - or - preferably, instead it embeds more CLI features that may be feature gated (such that each release includes 2 binaries: zola-serve (functionally equivalent to how it is now, used in production) and zola which includes zola-serve and all additional user CLI features))

That's kind of the opposite of the goal of the project though, "everything you need in one binary".

mkroman commented 4 years ago

That's kind of the opposite of the goal of the project though, "everything you need in one binary".

That's true, and it was also just a recommendation to avoid a big change for a niche feature, and to keep the binary size small-ish, but it probably doesn't add up to that much anyway.