cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

[Bug] Non-image content is being stored in the image table #438

Closed aldenstpage closed 3 years ago

aldenstpage commented 4 years ago

Bug Description

It seems like the Wikimedia source doesn't exclude non-image content. Here is a small subsample of some audio files I've encountered:

2a01b0fa-73b6-4911-889e-b02d41cf125d    https://upload.wikimedia.org/wikipedia/commons/6/62/Nl-mbo-studenten.ogg
9cd47e61-9389-46e1-b7a6-c3dd0e9efbdb    https://upload.wikimedia.org/wikipedia/commons/8/82/Nl-medaillekans.ogg
ec982a83-f036-4286-aedb-c2a45e540cb7    https://upload.wikimedia.org/wikipedia/commons/7/78/Nl-mea_culpa.ogg
136a5a87-a196-43d4-a506-ea3db5eea1d8    https://upload.wikimedia.org/wikipedia/commons/a/a8/Nl-meanderende.ogg
8ed0e5d4-0b6a-439e-9577-58496b2a10a1    https://upload.wikimedia.org/wikipedia/commons/2/2b/Nl-meao.ogg
a79cbff6-5026-47fd-9479-4cc00c24d307    https://upload.wikimedia.org/wikipedia/commons/e/e7/Nl-meccano%27s.ogg
a4c81ad0-0dee-4e62-8ed0-b50b553ab77c    https://upload.wikimedia.org/wikipedia/commons/b/b6/Nl-meccano.ogg
1e22bf99-fc4f-48a0-85a8-a6c705f4e3aa    https://upload.wikimedia.org/wikipedia/commons/8/8f/Nl-meccanodoos.ogg
7c8140b8-7929-4649-9f41-74d37f5f87b1    https://upload.wikimedia.org/wikipedia/commons/f/f0/Nl-meccanodozen.ogg
8530ab64-41ed-439e-b160-912352624502    https://upload.wikimedia.org/wikipedia/commons/5/56/Nl-mecenaat.ogg

This will cause some trouble with invalid content making its way into our thumbnail proxy, our image crawler, etc.

Of course, it won't be possible to completely validate the content type of a work without checking headers, but I think it would be sensible to have a file extension allowlist that will parse out the file extension when it is available and exclude clearly non-image works. If this isn't especially widespread maybe it can be solved by making some tweaks to the Wikimedia API task.

mathemancer commented 4 years ago

After taking a peek at the API responses, it seems like a large number of files returned by an 'allimages' query are non-image file types. I'm currently querying in our DB to see how large of a problem this is on our end.

I suspect we'll have to modify our Wikimedia Commons Provider API Script to avoid saving some non-image files. This will end up being a bit ad-hoc, though.

mathemancer commented 4 years ago

Further info:

It seems like the naming of that endpoint is outdated somehow. However, we can get a 'mediatype' field in the result. One of them, BITMAP, seems to cover the following MIME types:

            'image/x-bmp',
            'image/x-ms-bmp',
            'image/bmp',
            'image/gif',
            'image/jpeg',
            'image/png',
            'image/tiff',
            'image/vnd.djvu',
            'image/x-xcf',
            'image/x-portable-pixmap',
            'image/gif',
            'image/png',
            'image/x-png',
            'image/ief',
            'image/jpeg',
            'image/pjpeg',
            'image/jp2',
            'image/xbm',
            'image/tiff',
            'image/x-icon',
            'image/x-ico',
            'image/vnd.microsoft.icon',
            'image/x-rgb',
            'image/x-portable-pixmap',
            'image/x-portable-graymap',
            'image/x-portable-greymap',
            'image/x-bmp',
            'image/x-ms-bmp',
            'image/bmp',
            'application/x-bmp',
            'application/bmp',
            'image/x-photoshop',
            'image/psd',
            'image/x-psd',
            'image/photoshop',
            'image/vnd.adobe.photoshop',
            'image/webp',

Unfortunately, the MIME type doesn't appear to be directly available. However, restricting to just the BITMAP media type should avoid the audio files at least.

mathemancer commented 4 years ago

So, the implementer should add mediatype to the iiprop query param list in the requests, and only save an 'image' if it has mediatype: BITMAP.

mathemancer commented 4 years ago

Once this is done, we can clean up the DB.