Open foolip opened 7 months ago
cc @eeeps
Those URLs are returned with the following HTTP header:
Content-Type: application/octet-stream
That fails the Regex test here, so we fall back to looking at the file extension, at the place you identified.
I agree that the crawler knows more than we can, by looking at URLs and HTTP headers, and it would be nice to have the actual decoded type exposed to catch cases like this (or, failing that, at least to get a sense of how common such cases are). It might actually already be, because of the work Pat Meenan did in 2022 to get the actual image resources run through ImageMagick and a bunch of things reported (see the note in the README https://github.com/HTTPArchive/almanac.httparchive.org/tree/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media#notes-for-2023). I'll try to dig in later today to see why we didn't use that here.
That was fast! We didn't get to use any of the ImageMagick data here because this query is working from <img>
s found in the markup, rather than from HTTP requests. See my note in the readme about my failure to join requests up to loaded <img>
resources, and how that was my number one TODO going forward.
@eeeps is any of the code using ImageMagick running in the current crawls? I've been thinking about exactly that these past few days, if we could run identify -format "%Q\n"
for JPEG files in particular to understand the quality in a different way. I assumed that none of the resources are on disk so this would be a big lift, but it sounds like some of the work has already been done?
Is the $._image_details
data being written to anything in BigQuery yet? If not, is there a sample of that from the raw crawl data that I could look at? I'm interested to know what kind of stuff is in there and if it would help.
Yeah here's a way to cheaply (355.91 MB) query a sample of the $._image_details
object:
SELECT
url,
JSON_QUERY(payload, '$._image_details') AS image_details
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
date = '2024-01-01' AND
client = 'mobile' AND
is_root_page AND
type = 'image'
LIMIT
10
As per usual, Rick beat me to it. Different (older?) flavor:
SELECT
url,
JSON_QUERY(payload, '$._image_details') as image_details
FROM `httparchive.requests.2023_12_01_mobile` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE JSON_QUERY(payload, '$._image_details') IS NOT NULL
Thank you @rviscomi and @eeeps, my joy is boundless! I will play around with this.
After some terrible queries and intermediate tables I have a first result:
Is this the right repo to ask questions like _why is _imagedetails sometimes missing? and other things I'll need to figure out to refine this?
Yeah I think here is fine
cc @pmeenan
@foolip not sure about venue (if I have a discussion that might require a chattier exploration, I generally start it in the HTTP Archive Slack), but @pmeenan is the person to ask about missing _image_details
.
Interesting chart! I do worry though... the "quality" reported by ImageMagick's identify
for JPEGs, like most 0-100 quality scales used by encoders, is arbitrary and IM- and JPEG-specific. It's based on the quantization tables IM finds in the file, which will mostly correlate with what people think "quality" means (a subjective evaluation of "how good" the output looks when compared with the input), but not at all exactly. Worse, this value doesn't line up with other formats or other tools. People expect "quality 80" to mean the same thing everywhere. It does not, even for tools that are only dealing with JPEGs, and once you're talking other formats, you're in another universe.
That said... the number of quality: 100
JPEGs here -- wow. Antipattern!
If you have examples for where it is missing I can take a look. It could happen if for some reason the image response body isn't available or the code that detects the image type by looking at the header bytes doesn't recognize it.
heif is definitely not detected but the others should be reasonably up to date.
I have been poking at https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2022/media/bytes_and_dimensions_by_format.sql to get an updated view of quality distributions in the wild.
I happened to look for 'heif' images and was surprised how many I found. It turns out that for example https://gaijincph.dk/ serves https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic
That gets classified by
pithyType()
as 'heif' because the URL ends with '.heic'. However, it's actually a JPEG.It seems like only the URL is used in fact, because of this call here:
https://github.com/HTTPArchive/almanac.httparchive.org/blob/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media/bytes_and_dimensions_by_format.sql#L112
There is no
mimeType
in the data, at least not in thehttparchive.pages.2024_01_01_desktop
data. Here's what I unpacked frompayload
and a few nested JSON objects for https://gaijincph.dk/:Since the number of bytes and the decoded width and height are known, the decoder that was actually used should in principle be knowable.