HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
611 stars 168 forks source link

Images are classified only by URL? #3572

Open foolip opened 7 months ago

foolip commented 7 months ago

I have been poking at https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2022/media/bytes_and_dimensions_by_format.sql to get an updated view of quality distributions in the wild.

I happened to look for 'heif' images and was surprised how many I found. It turns out that for example https://gaijincph.dk/ serves https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic

That gets classified by pithyType() as 'heif' because the URL ends with '.heic'. However, it's actually a JPEG.

It seems like only the URL is used in fact, because of this call here:

https://github.com/HTTPArchive/almanac.httparchive.org/blob/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media/bytes_and_dimensions_by_format.sql#L112

There is no mimeType in the data, at least not in the httparchive.pages.2024_01_01_desktop data. Here's what I unpacked from payload and a few nested JSON objects for https://gaijincph.dk/:

[
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": false,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/logo.png?v58288",
    "totalCandidates": 1,
    "altAttribute": "GAIJIN logo",
    "clientWidth": 150,
    "clientHeight": 134,
    "naturalWidth": 2097,
    "naturalHeight": 1598,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 2097,
    "approximateResourceHeight": 1598,
    "byteSize": 125672,
    "bitsPerPixel": 0.3000221426043403,
    "computedSizingStyles": {
      "width": "auto",
      "height": "auto",
      "maxWidth": "150px",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "both",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/e6680b7e-a494-49d3-b8dd-027338d28566_m.jpg",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of The Full Gaijin Experience",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 826,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 826,
    "byteSize": 283244,
    "bitsPerPixel": 3.8101156846919557,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Tasting menu",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 960,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 960,
    "byteSize": 214586,
    "bitsPerPixel": 2.4836342592592593,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/73c94994-f80a-4f1c-b524-44d3e80e28ee_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of A la carte",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 900,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 900,
    "byteSize": 248070,
    "bitsPerPixel": 3.0625925925925928,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/29fd156a-2b46-4cb6-a5a3-0e481f66aaba_m.png",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Private Dining",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 709,
    "naturalHeight": 540,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 709,
    "approximateResourceHeight": 540,
    "byteSize": 22605,
    "bitsPerPixel": 0.47233975865851746,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/coverimage_l.jpeg",
    "totalCandidates": 1,
    "widthAttribute": "100%",
    "altAttribute": "",
    "clientWidth": 918,
    "clientHeight": 918,
    "naturalWidth": 1200,
    "naturalHeight": 1200,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 1200,
    "approximateResourceHeight": 1200,
    "byteSize": 61206,
    "bitsPerPixel": 0.34003333333333335,
    "computedSizingStyles": {
      "width": "100%",
      "height": "auto",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  }
]

Since the number of bytes and the decoded width and height are known, the decoder that was actually used should in principle be knowable.

rviscomi commented 7 months ago

cc @eeeps

eeeps commented 7 months ago

Those URLs are returned with the following HTTP header:

Content-Type: application/octet-stream

That fails the Regex test here, so we fall back to looking at the file extension, at the place you identified.

I agree that the crawler knows more than we can, by looking at URLs and HTTP headers, and it would be nice to have the actual decoded type exposed to catch cases like this (or, failing that, at least to get a sense of how common such cases are). It might actually already be, because of the work Pat Meenan did in 2022 to get the actual image resources run through ImageMagick and a bunch of things reported (see the note in the README https://github.com/HTTPArchive/almanac.httparchive.org/tree/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media#notes-for-2023). I'll try to dig in later today to see why we didn't use that here.

eeeps commented 7 months ago

That was fast! We didn't get to use any of the ImageMagick data here because this query is working from <img>s found in the markup, rather than from HTTP requests. See my note in the readme about my failure to join requests up to loaded <img> resources, and how that was my number one TODO going forward.

foolip commented 7 months ago

@eeeps is any of the code using ImageMagick running in the current crawls? I've been thinking about exactly that these past few days, if we could run identify -format "%Q\n" for JPEG files in particular to understand the quality in a different way. I assumed that none of the resources are on disk so this would be a big lift, but it sounds like some of the work has already been done?

foolip commented 7 months ago

Is the $._image_details data being written to anything in BigQuery yet? If not, is there a sample of that from the raw crawl data that I could look at? I'm interested to know what kind of stuff is in there and if it would help.

rviscomi commented 7 months ago

Yeah here's a way to cheaply (355.91 MB) query a sample of the $._image_details object:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') AS image_details
FROM
  `httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
  date = '2024-01-01' AND
  client = 'mobile' AND
  is_root_page AND
  type = 'image'
LIMIT
  10
Sample result ```json { "detected_type": "jpeg", "metadata": { "ExifTool": { "ExifToolVersion": 12.52 }, "File": { "FileSize": "137 kB", "FileType": "JPEG", "FileTypeExtension": "jpg", "MIMEType": "image/jpeg", "Comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n", "ImageWidth": 800, "ImageHeight": 800, "EncodingProcess": "Baseline DCT, Huffman coding", "BitsPerSample": 8, "ColorComponents": 3, "YCbCrSubSampling": "YCbCr4:2:0 (2 2)" }, "JFIF": { "JFIFVersion": 1.01, "ResolutionUnit": "inches", "XResolution": 96, "YResolution": 96 }, "Composite": { "ImageSize": "800x800", "Megapixels": 0.64 } }, "magick": { "baseName": "10710.94", "format": "JPEG", "formatDescription": "JPEG", "mimeType": "image/jpeg", "class": "DirectClass", "geometry": { "width": 800, "height": 800, "x": 0, "y": 0 }, "resolution": { "x": 96, "y": 96 }, "printSize": { "x": 8.33333, "y": 8.33333 }, "units": "PixelsPerInch", "type": "TrueColor", "baseType": "Undefined", "endianness": "Undefined", "colorspace": "sRGB", "depth": 8, "baseDepth": 8, "channelDepth": { "red": 8, "green": 8, "blue": 1 }, "pixels": 1920000, "imageStatistics": { "Overall": { "min": 0, "max": 255, "mean": 65.7495, "median": 35.6667, "standardDeviation": 79.6716, "kurtosis": 0.0952899, "skewness": 1.18423, "entropy": 0.835339 } }, "channelStatistics": { "red": { "min": 0, "max": 255, "mean": 58.5843, "median": 15, "standardDeviation": 82.5814, "kurtosis": 0.438709, "skewness": 1.4063, "entropy": 0.805027 }, "green": { "min": 0, "max": 255, "mean": 60.4429, "median": 30, "standardDeviation": 76.1642, "kurtosis": 0.756071, "skewness": 1.40876, "entropy": 0.838816 }, "blue": { "min": 0, "max": 255, "mean": 78.2214, "median": 62, "standardDeviation": 80.2692, "kurtosis": -0.494016, "skewness": 0.80254, "entropy": 0.862173 } }, "renderingIntent": "Perceptual", "gamma": 0.454545, "chromaticity": { "redPrimary": { "x": 0.64, "y": 0.33 }, "greenPrimary": { "x": 0.3, "y": 0.6 }, "bluePrimary": { "x": 0.15, "y": 0.06 }, "whitePrimary": { "x": 0.3127, "y": 0.329 } }, "matteColor": "#BDBDBD", "backgroundColor": "#FFFFFF", "borderColor": "#DFDFDF", "transparentColor": "#00000000", "interlace": "None", "intensity": "Undefined", "compose": "Over", "pageGeometry": { "width": 800, "height": 800, "x": 0, "y": 0 }, "dispose": "Undefined", "iterations": 0, "compression": "JPEG", "quality": 80, "orientation": "Undefined", "properties": { "comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n", "date:create": "2024-01-13T04:33:02+00:00", "date:modify": "2024-01-13T04:33:02+00:00", "date:timestamp": "2024-01-13T04:34:18+00:00", "jpeg:colorspace": "2", "jpeg:sampling-factor": "2x2,1x1,1x1", "signature": "0d0e8995e2aae98c15e1e2bc69c8f988423e022cf4055d72e9752a457a53a440" }, "tainted": false, "filesize": "136972B", "numberPixels": "640000", "pixelsPerSecond": "40.1272MB", "userTime": "0.020u", "elapsedTime": "0:01.015" } } ```
eeeps commented 7 months ago

As per usual, Rick beat me to it. Different (older?) flavor:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') as image_details
FROM `httparchive.requests.2023_12_01_mobile` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE JSON_QUERY(payload, '$._image_details') IS NOT NULL

results

foolip commented 7 months ago

Thank you @rviscomi and @eeeps, my joy is boundless! I will play around with this.

foolip commented 7 months ago

After some terrible queries and intermediate tables I have a first result:

JPEG quality

Is this the right repo to ask questions like _why is _imagedetails sometimes missing? and other things I'll need to figure out to refine this?

rviscomi commented 7 months ago

Yeah I think here is fine

cc @pmeenan

eeeps commented 7 months ago

@foolip not sure about venue (if I have a discussion that might require a chattier exploration, I generally start it in the HTTP Archive Slack), but @pmeenan is the person to ask about missing _image_details.

Interesting chart! I do worry though... the "quality" reported by ImageMagick's identify for JPEGs, like most 0-100 quality scales used by encoders, is arbitrary and IM- and JPEG-specific. It's based on the quantization tables IM finds in the file, which will mostly correlate with what people think "quality" means (a subjective evaluation of "how good" the output looks when compared with the input), but not at all exactly. Worse, this value doesn't line up with other formats or other tools. People expect "quality 80" to mean the same thing everywhere. It does not, even for tools that are only dealing with JPEGs, and once you're talking other formats, you're in another universe.

That said... the number of quality: 100 JPEGs here -- wow. Antipattern!

pmeenan commented 7 months ago

If you have examples for where it is missing I can take a look. It could happen if for some reason the image response body isn't available or the code that detects the image type by looking at the header bytes doesn't recognize it.

heif is definitely not detected but the others should be reasonably up to date.