WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
254 stars 202 forks source link

Some images are treated as unsupported file types in thumbnails, despite being supported file types #4852

Open sarayourfriend opened 2 months ago

sarayourfriend commented 2 months ago

Description

When an image extension is not known (e.g., it is not in filetype and cannot be pulled from the URL extension or HEAD content-type), we assume it is unsupported and do not try to make a thumbnail request.

This ends up primarily affecting only a subset of providers whose services function in such a way that we cannot know the file type ahead of the upstream thumbnail request. The example I've found is Smithsonian:

https://api.openverse.org/v1/images/ebdbe147-bceb-4c84-9736-d9d06a37a6a9/

The url has no extension, the record has no filetype, and the HEAD response has no content-type header:

13:41:45 ~ https HEAD https://ids.si.edu/ids/deliveryService/id/ark:/65665/m3e18cd9704c324b81aee222344ff03401 -p Hhm
HEAD /ids/deliveryService/id/ark:/65665/m3e18cd9704c324b81aee222344ff03401 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: ids.si.edu
User-Agent: HTTPie/3.2.2

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Date: Mon, 02 Sep 2024 03:41:49 GMT
Keep-Alive: timeout=2, max=1000
Set-Cookie: ROUTEID=.05; Path=/ids
Set-Cookie: <REDACTED>; Path=/; Domain=.si.edu; Secure; HttpOnly; 
Set-Cookie: <REDACTED>; path=/ids; HttpOnly; Secure
Transfer-Encoding: chunked

Elapsed time: 0.904084275s

To fix this, rather than assume the unknown file type is unsupported, it would be great if we could still try sending the request upstream to Site Accel.:

13:41:49 ~ https https://i0.wp.com/ids.si.edu/ids/deliveryService/id/ark:/65665/m3e18cd9704c324b81aee222344ff03401 -p Hhm
GET /ids.si.edu/ids/deliveryService/id/ark:/65665/m3e18cd9704c324b81aee222344ff03401 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: i0.wp.com
User-Agent: HTTPie/3.2.2

HTTP/1.1 200 OK
Access-Control-Allow-Methods: GET, HEAD
Access-Control-Allow-Origin: *
Alt-Svc: h3=":443"; ma=86400
Cache-Control: public, max-age=63115200
Connection: keep-alive
Content-Length: 8939411
Content-Type: image/jpeg
Date: Mon, 02 Sep 2024 03:43:29 GMT
ETag: "c45c220f14577f21"
Expires: Wed, 02 Sep 2026 15:43:29 GMT
Last-Modified: Mon, 02 Sep 2024 03:43:29 GMT
Link: <http://ids.si.edu/ids/deliveryService/id/ark:/65665/m3e18cd9704c324b81aee222344ff03401>; rel="canonical"
Server: nginx
Timing-Allow-Origin: *
Vary: Accept
X-Bytes-Saved: 285486
X-Content-Type-Options: nosniff
X-nc: MISS syd 4

Elapsed time: 3.246793642s

In the case of a successful response from Site Accel, we can cache that fact in Redis and bypass the extension check for that media in the future.

I'm not 100% sure of the content-type header that Site Accel returns. If it's accurate to the media type of the upstream image, we should cache that and in such a way that we can ETL it back into the catalogue data for that work, as with https://github.com/WordPress/openverse/issues/3585. Site Accelerator claims that it will return webp for clients that support it, but I sent Accept / in my httpie request and got a jpeg. When I try it in the browser, even with compression and resizing enabled, I still get a jpeg back. The upstream image is a jpeg, but I don't know if that's the reason Site Accel. returns a jpeg or if it would convert a PNG to a jpeg. I've looked at the Site Accelerator image processing code (formerly known as Photon), and I don't really see what would cause it to return a different file type than the upstream image. It would be worth reaching out to the Jetpack folks to see if they can clarify this for us. If we can reliably retrieve the file type after the request for works for which we don't have that information, it would be great to store and eventually ETL back into the catalogue!

We might also check and see whether Smithsonian in general has this issue, and implement special handling for them instead of needing to check at all.

Additional context

Provider-specific special handling for thumbnail requests has precedence in #4736.

sarayourfriend commented 2 months ago

@WordPress/openverse-api Is anyone available to reach out to the Jetpack folks to get clarification on the response content type from Site Accelerator?