wp_get_attachment_metadata() strips what it thinks are html tags in Exif metadata

creativecommons / wp-plugin-creativecommons

Official Creative Commons plugin for licensing your content. With Creative Commons licenses, keep your copyright AND share your creativity.

https://wordpress.org/plugins/creative-commons/

GNU General Public License v2.0

152 stars 104 forks source link

wp_get_attachment_metadata() strips what it thinks are html tags in Exif metadata #14

Open rheaplex opened 8 years ago

rheaplex commented 8 years ago

If we have a jpeg with a Copyright field like:

<no> tags, tags are <stripped>

then when we upload the file to WordPress and fetch the Exif metadata using:

wp_get_attachment_metadata($att_id[0], true);

then the string we get for Copyright is:

 tags, tags are

I assume this is due to WordPress taking the sensible precaution of stripping HTML tags from outside input, but it does mean that the format we are using for license URLs falls foul of this.

I've chased this down the call stack a way and I can't find anywhere to change it. I'd rather not have to use php's exif parsing, although I've just tested that and it doesn't have the same problem.

Investigating further, but if anyone knows of a quick fix for this please let me know.

rheaplex commented 8 years ago

in wp-admin/includes/image.php :

wp_read_image_metadata()

calls:

wp_kses_post_deep()

which strips tags.

So we have to use the php exif parsing. I'll look at hooking this in for the Media editor, pulling the values for the fields if they are not otherwise populated.

@mattl does this indicate a more general problem with the Exif tag format we are proposing? I don't believe so, but worth considering. Also maybe we should consider adding source and CC+ if we haven't already.

rheaplex commented 8 years ago

Making progress with the php Exif parsing, just trying to make it efficient for the code and logical for the user.

rheaplex commented 8 years ago

Code now extracts license and attribution url when you view the media. Looking to see if I can hook this in to the image upload process, but if not this will be Good Enough, I think.

rheaplex commented 8 years ago

Metadata now extracted on image upload.

This won't get metadata for existing images if the plugin is installed and we have (e.g.) 20,000 images with Exif already in the system.

@mattl we can run the extract code when you view the image in the Media editor, or is this something we might want to give the user the option of running manually from the settings for the plugin (a button [Scan Existing Images for License Metadata And Apply It] ) if that's possible?

mattl commented 8 years ago

Won't existing images have been previously stripped by WordPress?

rheaplex commented 8 years ago

I don't believe so. The strings are stripped after reading from the file, rather than the file itself being sanitised.

mattl commented 8 years ago

screenshot from 2016-07-29 15-56-15

Maybe something like this? We could pull all the existing images from the CC website as a test, but also @ericsteuer has good insight into how this works on a big site liked Wired.com who probably have a few hundred thousand images.

rheaplex commented 8 years ago

I had in mind more a global "Extract CC License metadata where present but don't overwrite anything" option.

We could also add a button to the media manager to do this for individual images.

So the former would support hundreds of thousands, the latter just a few if you only want to use a few.

On Fri, Jul 29, 2016 at 1:57 PM, Matt Lee notifications@github.com wrote:

[image: screenshot from 2016-07-29 15-56-15] https://cloud.githubusercontent.com/assets/33296/17263143/0538b4a8-55a5-11e6-9471-e62fa5f2e11b.png

Maybe something like this? We could pull all the existing images from the CC website as a test, but also @ericsteuer https://github.com/ericsteuer has good insight into how this works on a big site liked Wired.com who probably have a few hundred thousand images.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/creativecommons/wordpress-plugin/issues/14#issuecomment-236291472, or mute the thread https://github.com/notifications/unsubscribe-auth/AABU8ocliyFl3zU0pVeCqom5cyrKPavHks5qamkxgaJpZM4JWuJI .

mattl commented 8 years ago

The worry I have there is that we'd wind up adding extra captions to existing images all over the place.

rheaplex commented 8 years ago

Sure. It's the sort of thing where the user will want the plugin to do the right thing, for a value of "the right thing" that will differ from case to case. And they'll really want an Undo button.

So if this is too difficult to do usefully we shouldn't make something that will just frustrate people. :-)

BjornW commented 8 years ago

Why not use the 'regenerate thumbnails approach' in which you have a plugin run once for all existing images? This could be a seperate add-on plugin which can be removed after it has run, since it's likely to be run only once.