When an image extension is not known (e.g., it is not in filetype and cannot be pulled from the URL extension or HEAD content-type), we assume it is unsupported and do not try to make a thumbnail request.
This ends up primarily affecting only a subset of providers whose services function in such a way that we cannot know the file type ahead of the upstream thumbnail request. The example I've found is Smithsonian:
To fix this, rather than assume the unknown file type is unsupported, it would be great if we could still try sending the request upstream to Site Accel.:
In the case of a successful response from Site Accel, we can cache that fact in Redis and bypass the extension check for that media in the future.
I'm not 100% sure of the content-type header that Site Accel returns. If it's accurate to the media type of the upstream image, we should cache that and in such a way that we can ETL it back into the catalogue data for that work, as with https://github.com/WordPress/openverse/issues/3585. Site Accelerator claims that it will return webp for clients that support it, but I sent Accept / in my httpie request and got a jpeg. When I try it in the browser, even with compression and resizing enabled, I still get a jpeg back. The upstream image is a jpeg, but I don't know if that's the reason Site Accel. returns a jpeg or if it would convert a PNG to a jpeg. I've looked at the Site Accelerator image processing code (formerly known as Photon), and I don't really see what would cause it to return a different file type than the upstream image. It would be worth reaching out to the Jetpack folks to see if they can clarify this for us. If we can reliably retrieve the file type after the request for works for which we don't have that information, it would be great to store and eventually ETL back into the catalogue!
We might also check and see whether Smithsonian in general has this issue, and implement special handling for them instead of needing to check at all.
Additional context
Provider-specific special handling for thumbnail requests has precedence in #4736.
@WordPress/openverse-api Is anyone available to reach out to the Jetpack folks to get clarification on the response content type from Site Accelerator?
Description
When an image extension is not known (e.g., it is not in
filetype
and cannot be pulled from the URL extension orHEAD
content-type), we assume it is unsupported and do not try to make a thumbnail request.This ends up primarily affecting only a subset of providers whose services function in such a way that we cannot know the file type ahead of the upstream thumbnail request. The example I've found is Smithsonian:
https://api.openverse.org/v1/images/ebdbe147-bceb-4c84-9736-d9d06a37a6a9/
The
url
has no extension, the record has nofiletype
, and the HEAD response has no content-type header:To fix this, rather than assume the unknown file type is unsupported, it would be great if we could still try sending the request upstream to Site Accel.:
In the case of a successful response from Site Accel, we can cache that fact in Redis and bypass the extension check for that media in the future.
I'm not 100% sure of the content-type header that Site Accel returns. If it's accurate to the media type of the upstream image, we should cache that and in such a way that we can ETL it back into the catalogue data for that work, as with https://github.com/WordPress/openverse/issues/3585. Site Accelerator claims that it will return webp for clients that support it, but I sent Accept / in my httpie request and got a jpeg. When I try it in the browser, even with compression and resizing enabled, I still get a jpeg back. The upstream image is a jpeg, but I don't know if that's the reason Site Accel. returns a jpeg or if it would convert a PNG to a jpeg. I've looked at the Site Accelerator image processing code (formerly known as Photon), and I don't really see what would cause it to return a different file type than the upstream image. It would be worth reaching out to the Jetpack folks to see if they can clarify this for us. If we can reliably retrieve the file type after the request for works for which we don't have that information, it would be great to store and eventually ETL back into the catalogue!
We might also check and see whether Smithsonian in general has this issue, and implement special handling for them instead of needing to check at all.
Additional context
Provider-specific special handling for thumbnail requests has precedence in #4736.