Closed aldenstpage closed 3 years ago
After taking a peek at the API responses, it seems like a large number of files returned by an 'allimages' query are non-image file types. I'm currently querying in our DB to see how large of a problem this is on our end.
I suspect we'll have to modify our Wikimedia Commons Provider API Script to avoid saving some non-image files. This will end up being a bit ad-hoc, though.
Further info:
It seems like the naming of that endpoint is outdated somehow. However, we can get a 'mediatype' field in the result. One of them, BITMAP
, seems to cover the following MIME types:
'image/x-bmp',
'image/x-ms-bmp',
'image/bmp',
'image/gif',
'image/jpeg',
'image/png',
'image/tiff',
'image/vnd.djvu',
'image/x-xcf',
'image/x-portable-pixmap',
'image/gif',
'image/png',
'image/x-png',
'image/ief',
'image/jpeg',
'image/pjpeg',
'image/jp2',
'image/xbm',
'image/tiff',
'image/x-icon',
'image/x-ico',
'image/vnd.microsoft.icon',
'image/x-rgb',
'image/x-portable-pixmap',
'image/x-portable-graymap',
'image/x-portable-greymap',
'image/x-bmp',
'image/x-ms-bmp',
'image/bmp',
'application/x-bmp',
'application/bmp',
'image/x-photoshop',
'image/psd',
'image/x-psd',
'image/photoshop',
'image/vnd.adobe.photoshop',
'image/webp',
Unfortunately, the MIME type doesn't appear to be directly available. However, restricting to just the BITMAP
media type should avoid the audio files at least.
So, the implementer should add mediatype
to the iiprop
query param list in the requests, and only save an 'image' if it has mediatype: BITMAP
.
Once this is done, we can clean up the DB.
Bug Description
It seems like the Wikimedia source doesn't exclude non-image content. Here is a small subsample of some audio files I've encountered:
This will cause some trouble with invalid content making its way into our thumbnail proxy, our image crawler, etc.
Of course, it won't be possible to completely validate the content type of a work without checking headers, but I think it would be sensible to have a file extension allowlist that will parse out the file extension when it is available and exclude clearly non-image works. If this isn't especially widespread maybe it can be solved by making some tweaks to the Wikimedia API task.