fix: Consistent image ID hashes across machines

AdrianGonz97 commented 5 months ago

This PR fixes an issue that we've been experiencing in https://github.com/huntabyte/shadcn-svelte/pull/978 where the hash for the generated image ID is different when it's computed on different machines. The goal of that PR was to cache the images so that they can be utilized across our GH runners.

The source of the issue is the mtime stat from the image file, which seemingly differs from machine to machine, causing the hash to generate a different id, resulting in a cache MISS. Instead, mtime has been replaced for the size stat, which (in conjunction with the image config and file url) should provide sufficient uniqueness for the hash.

Quick Checklist
[x] I have read the contributing guidelines
[x] I have written new tests, as applicable (for bug fixes / features)
[ ] Docs have been added / updated (for bug fixes / features)
[x] I have added a changeset, if applicable
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...) Bug fix
What is the new behavior (if this is a feature change)? Implements consistent image ID hashing across machines
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?) No
Other information: I wasn't sure the best way to express this change in a test, so I've hardcoded the expected id string. I'm sure there's a better way to do it, so please feel free to modify it!

changeset-bot[bot] commented 5 months ago

🦋 Changeset detected

Latest commit: e43d13f31e1f98e95db02af1f6f045d678117d77

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

| Name | Type | | --------------- | ----- | | vite-imagetools | Patch |

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

benmccann commented 5 months ago

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes? Or is the mtime set based upon when you check the repo out from git. I'd love to better understand the issue before we switch away from it

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.48%. Comparing base (8790098) to head (a4aec62). Report is 2 commits behind head on main.

:exclamation: Current head a4aec62 differs from pull request most recent head e43d13f. Consider uploading reports for the commit e43d13f to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #711 +/- ## ========================================== - Coverage 95.49% 95.48% -0.01% ========================================== Files 33 33 Lines 1288 1286 -2 Branches 226 226 ========================================== - Hits 1230 1228 -2 Misses 58 58 ``` | [Flag](https://app.codecov.io/gh/JonasKruckenberg/imagetools/pull/711/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jonas+Kruckenberg) | Coverage Δ | | |---|---|---| | [imagetools-core](https://app.codecov.io/gh/JonasKruckenberg/imagetools/pull/711/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jonas+Kruckenberg) | `95.48% <100.00%> (-0.01%)` | :arrow_down: | | [vite-imagetools](https://app.codecov.io/gh/JonasKruckenberg/imagetools/pull/711/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jonas+Kruckenberg) | `95.48% <100.00%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jonas+Kruckenberg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

AdrianGonz97 commented 5 months ago

Hmm, that's really weird. I'm surprised the mtime would be different across machines. Are they different OSes?

It's definitely strange 😄. The OSes are set to be the same as they're running from the same workflow, so I don't think it's that. Perhaps it's more related to timezones? 🤔

Or is the mtime set based upon when you check the repo out from git

I'd be happy to dig into this question when I get a chance!

AdrianGonz97 commented 5 months ago

Or is the mtime set based upon when you check the repo out from git

It seems like this assumption was correct, @benmccann! After checking out the repo on 2 separate instances (on the same machine), I'm ending up with two different mtime stats.

File stats 1:

```bash { stats: Stats { dev: 2080, mode: 33188, nlink: 1, uid: 1000, gid: 1000, rdev: 0, blksize: 4096, ino: 768440, size: 159643, blocks: 312, atimeMs: 1712710759640.3115, mtimeMs: 1712710733650.3157, ctimeMs: 1712710733650.3157, birthtimeMs: 1712710733650.3157, atime: 2024-04-10T00:59:19.640Z, mtime: 2024-04-10T00:58:53.650Z, ctime: 2024-04-10T00:58:53.650Z, birthtime: 2024-04-10T00:58:53.650Z } } ```

File stats 2:

```bash { stats: Stats { dev: 2080, mode: 33188, nlink: 1, uid: 1000, gid: 1000, rdev: 0, blksize: 4096, ino: 1598355, size: 159643, blocks: 312, atimeMs: 1712711152420.249, mtimeMs: 1712711150750.2493, ctimeMs: 1712711150750.2493, birthtimeMs: 1712711150750.2493, atime: 2024-04-10T01:05:52.420Z, mtime: 2024-04-10T01:05:50.750Z, ctime: 2024-04-10T01:05:50.750Z, birthtime: 2024-04-10T01:05:50.750Z } } ```

benmccann commented 4 months ago

Sorry for the long delay. I'm afraid I'd forgotten about this PR.

I'm nervous about using the file size because you could hit false positives with it. I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I had an idea that we could use mtime first and then fallback to the hash if it's different, but that probably wouldn't work because it would require us storing two hashes.

AdrianGonz97 commented 4 months ago

I'm nervous about using the file size because you could hit false positives with it.

Fair enough!

I'm not sure how expensive it would be to compute a hash for every file, but it seems safer.

I agree. Testing locally, I modified it to use imageBuffer instead, which had no perceivable performance impact relative to the rest of build.

I also did some microbenchmarking for it (just in case), testing different image sizes to see their impact:

┌─────────┬───────────────────────────────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name                         │ ops/sec   │ Average Time (ns)  │ Margin   │ Samples │
├─────────┼───────────────────────────────────┼───────────┼────────────────────┼──────────┼─────────┤
│ 0       │ 'hash with size - base (500kb)'   │ '231,842' │ 4313.26988522398   │ '±0.63%' │ 231843  │
│ 1       │ 'hash with buffer - base (500kb)' │ '5,979'   │ 167226.18963211094 │ '±0.23%' │ 5980    │
│ 2       │ 'hash with buffer - 11mb'         │ '165'     │ 6036291.090361429  │ '±0.43%' │ 166     │
│ 3       │ 'hash with buffer - 5mb'          │ '311'     │ 3215181.397435922  │ '±0.31%' │ 312     │
│ 4       │ 'hash with buffer - 2.5mb'        │ '1,099'   │ 909751.2799999929  │ '±0.20%' │ 1100    │
│ 5       │ 'hash with buffer - 1mb'          │ '2,599'   │ 384681.17346155015 │ '±0.40%' │ 2600    │
│ 6       │ 'hash with buffer - 500kb'        │ '5,875'   │ 170197.13359428805 │ '±0.28%' │ 5876    │
│ 7       │ 'hash with buffer - 350kb'        │ '6,425'   │ 155626.61889200658 │ '±0.25%' │ 6426    │
│ 8       │ 'hash with buffer - 100kb'        │ '16,379'  │ 61051.895238088415 │ '±0.44%' │ 16380   │
└─────────┴───────────────────────────────────┴───────────┴────────────────────┴──────────┴─────────┘

Here's the repo for it too: https://github.com/AdrianGonz97/imagetools-hashing-benchmarks

Hashing with the size (or mtime) stat is significantly faster, however, I would consider hashing with the image to be sufficiently fast enough for our needs. Especially since this operation doesn't execute hundreds of thousands of times per build.

So in practice, I think it's completely fine to hash with the image itself (even without the mtime optimization you described) as the performance impact is miniscule when compared to the rest of the build.

Side note: While writing the benchmark, I noticed that generateImageID is async when it doesn't need to be. ~~Do you have any idea why that's the case?~~ Scratch that, looking back at the history, it seems like it used to await a promise before the caching PR was introduced. Not a problem, I can fix that.

JonasKruckenberg / imagetools

fix: Consistent image ID hashes across machines #711

🦋 Changeset detected

Codecov Report