aptly-dev / aptly

aptly - Debian repository management tool
https://www.aptly.info/
MIT License
2.56k stars 374 forks source link

API allows concurrent repo publishes, resulting in incorrect by-hash files / Hash Sum Mismatch errors. #1163

Closed jmunson closed 1 year ago

jmunson commented 1 year ago

Detailed Description

We've been running into issues with "hash sum mismatch" errors fetching the by-hash files. Looking into the logs it appears multiple publish jobs are happening simultaneously, behaving inconsistently

In some cases we'll see errors with package files already being linked, where the second publish job returns an error, but the repo continues to work fine: {code} Mar 10 18:48:49 host aptly[1025558]: Added: redactedpkg_2023.3.1_arm64 added Mar 10 18:48:49 host aptly[1025558]: [GIN] 2023/03/10 - 18:48:49 | 200 | 1.278240474s | 1.2.3.4 | POST "/api/repos/v2_bullseye_stable_common/file/1678474127862-b'HG5YGLPCQ7Y6UJJN'/redactedpkg_2023.3.1_arm64.deb" Mar 10 18:48:49 host aptly[1025558]: 2023/03/10 18:48:49 Executing task synchronously Mar 10 18:48:49 host aptly[1025558]: Loading packages... Mar 10 18:48:49 host aptly[1025558]: 2023/03/10 18:48:49 Executing task synchronously Mar 10 18:48:49 host aptly[1025558]: Loading packages... Mar 10 18:48:50 host aptly[1025558]: Generating metadata files and linking package files... Mar 10 18:48:50 host aptly[1025558]: Generating metadata files and linking package files... Mar 10 18:48:52 host aptly[1025558]: [GIN] 2023/03/10 - 18:48:52 | 500 | 2.501879938s | 1.2.3.4 | PUT "/api/publish/v2_bullseye/stable" Mar 10 18:48:52 host aptly[1025558]: Error #01: unable to update: unable to process packages: link/redacted/pool/93/41/1e528dbc42b80bbcd0a34a23ce66_redactedpkg_2023.3.1_arm64.deb/redacted/public/v2/bullseye/pool/common/n/redactedpkg/redactedpkg_2023.3.1_arm64.deb: file exists Mar 10 18:48:53 host aptly[1025558]: Finalizing metadata files... Mar 10 18:48:59 host aptly[1025558]: Signing file 'Release' with gpg, please enter your passphrase when prompted: Mar 10 18:48:59 host aptly[1025558]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted: Mar 10 18:48:59 host aptly[1025558]: Cleaning up prefix "v2/bullseye" components secret, common... Mar 10 18:49:01 host aptly[1025558]: [GIN] 2023/03/10 - 18:49:01 | 200 | 11.554931014s | 1.2.3.4 | PUT "/api/publish/v2_bullseye/stable" {code}

Other times we've seen the .old files end up with incorrect checksums: {code} $ sha512sum .old 5803824573ed6fc0d7b612b52243a68cc64a62936f7b0c8cd6081ea4702de9c4e4d5c15a2be368017994335037f872a55a407ee2fb7c881d6581d6f2b0ce7e2f Packages.bz2.old 1c0a2bccbb7611ccdcf3170bf968d39abc4d687af57e90373ef6d4d6286e2ba33f71b5abcc0b2e03712d745e2a12f7a308a2a16726b7a04c9fd35ba9e9a12e64 Packages.gz.old 7385092a8ac88326e8ae6ea25d385a45b738000fb83c4eab1b54f7f68c9bd4aaa4326dd68c2e4c76fe6c50311c9f57b31fab9bd3953ae96e22e95636c2be0e26 Packages.old 7d8a4ec1544783b8fa03b95bb6ed3d0d36cec15405244c74e5ce727c717e9937c7c7cfbd6291842fc8df17e7baf5ae2b90747a35789373f685962d4b2eb83c04 Release.old $ ls -l .old lrwxrwxrwx 1 aptly aptly 211 Mar 10 13:58 Packages.bz2.old -> /redacted/common/binary-amd64/by-hash/SHA512/b8b70c64b71c36829770cfe46b956bd2ac0bbe8b45d898fff0296eaadeb17cecbc41d61a754ae3de8f86754eafe39795ad359eb28575959a83ea8ddcf1734a05 lrwxrwxrwx 1 aptly aptly 211 Mar 10 13:58 Packages.gz.old -> /redacted/common/binary-amd64/by-hash/SHA512/dff9c6a92741668876c63cd731facd44508257a881653e93166140790756de9898a5d238dbdcbc7e07d8ca07e9f9472fe958c64d7653d9e00fb80a55921f0aef lrwxrwxrwx 1 aptly aptly 211 Mar 10 13:58 Packages.old -> /redacted/common/binary-amd64/by-hash/SHA512/7385092a8ac88326e8ae6ea25d385a45b738000fb83c4eab1b54f7f68c9bd4aaa4326dd68c2e4c76fe6c50311c9f57b31fab9bd3953ae96e22e95636c2be0e26 lrwxrwxrwx 1 aptly root 211 Aug 3 2021 Release.old -> /redacted/common/binary-amd64/by-hash/SHA512/7d8a4ec1544783b8fa03b95bb6ed3d0d36cec15405244c74e5ce727c717e9937c7c7cfbd6291842fc8df17e7baf5ae2b90747a35789373f685962d4b2eb83c04

$ sha512sum 7d8a4ec1544783b8fa03b95bb6ed3d0d36cec15405244c74e5ce727c717e9937c7c7cfbd6291842fc8df17e7baf5ae2b90747a35789373f685962d4b2eb83c04 7385092a8ac88326e8ae6ea25d385a45b738000fb83c4eab1b54f7f68c9bd4aaa4326dd68c2e4c76fe6c50311c9f57b31fab9bd3953ae96e22e95636c2be0e26 dff9c6a92741668876c63cd731facd44508257a881653e93166140790756de9898a5d238dbdcbc7e07d8ca07e9f9472fe958c64d7653d9e00fb80a55921f0aef b8b70c64b71c36829770cfe46b956bd2ac0bbe8b45d898fff0296eaadeb17cecbc41d61a754ae3de8f86754eafe39795ad359eb28575959a83ea8ddcf1734a05 7d8a4ec1544783b8fa03b95bb6ed3d0d36cec15405244c74e5ce727c717e9937c7c7cfbd6291842fc8df17e7baf5ae2b90747a35789373f685962d4b2eb83c04 7d8a4ec1544783b8fa03b95bb6ed3d0d36cec15405244c74e5ce727c717e9937c7c7cfbd6291842fc8df17e7baf5ae2b90747a35789373f685962d4b2eb83c04 7385092a8ac88326e8ae6ea25d385a45b738000fb83c4eab1b54f7f68c9bd4aaa4326dd68c2e4c76fe6c50311c9f57b31fab9bd3953ae96e22e95636c2be0e26 7385092a8ac88326e8ae6ea25d385a45b738000fb83c4eab1b54f7f68c9bd4aaa4326dd68c2e4c76fe6c50311c9f57b31fab9bd3953ae96e22e95636c2be0e26 1c0a2bccbb7611ccdcf3170bf968d39abc4d687af57e90373ef6d4d6286e2ba33f71b5abcc0b2e03712d745e2a12f7a308a2a16726b7a04c9fd35ba9e9a12e64 dff9c6a92741668876c63cd731facd44508257a881653e93166140790756de9898a5d238dbdcbc7e07d8ca07e9f9472fe958c64d7653d9e00fb80a55921f0aef 5803824573ed6fc0d7b612b52243a68cc64a62936f7b0c8cd6081ea4702de9c4e4d5c15a2be368017994335037f872a55a407ee2fb7c881d6581d6f2b0ce7e2f b8b70c64b71c36829770cfe46b956bd2ac0bbe8b45d898fff0296eaadeb17cecbc41d61a754ae3de8f86754eafe39795ad359eb28575959a83ea8ddcf1734a05

{code} Note that Packages.old and Releases.old point to files with contents that match their sha512sum, but Packages.old.gz and Packages.old.bz2 point files with contents that do not match their filename.

In one case we've had a correct "Package.old" file but Packages.gz.old is the same as Packages.gz, and Packages.bz2.old is the same as Packages.bz2

We've also seen this error in logs: {code} Mar 09 23:35:09 host aptly[1025558]: [GIN] 2023/03/09 - 23:35:09 | 200 | 11.530434424s | 1.2.3.4 | PUT "/api/publish/v2_bullseye/stable" Mar 09 23:35:10 host aptly[1025558]: Signing file 'Release' with gpg, please enter your passphrase when prompted: Mar 09 23:35:10 host aptly[1025558]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted: Mar 09 23:35:10 host aptly[1025558]: [GIN] 2023/03/09 - 23:35:10 | 500 | 9.917119908s | 1.2.3.4 | PUT "/api/publish/v2_bullseye/stable" Mar 09 23:35:10 host aptly[1025558]: Error #01: unable to update: unable to rename: rename /redacted/public/v2/bullseye/dists/stable/secret/binary-arm64/Release.tmp /redacted/public/v2/bullseye/dists/stable/secret/binary-arm64/Release: no such file or directory Mar 09 23:35:11 host aptly[1025558]: Signing file 'Release' with gpg, please enter your passphrase when prompted: Mar 09 23:35:11 host aptly[1025558]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted: Mar 09 23:35:11 host aptly[1025558]: [GIN] 2023/03/09 - 23:35:11 | 500 | 9.682278226s | 1.2.3.4 | PUT "/api/publish/v2_bullseye/stable" Mar 09 23:35:11 host aptly[1025558]: Error #01: unable to update: unable to rename: rename /redacted/public/v2/bullseye/dists/stable/common/binary-amd64/Packages.tmp /redacted/public/v2/bullseye/dists/stable/common/binary-amd64/Packages: no such file or directory {code}

Context

This is important to me as while the latest data should still work and thus re-trying the apt-get update operation should fix it, many systems expect apt-get to Just Work and do not attempt to retry on failures, making this rather impactful especially in CI environments where lots of work might happen prior to the failed apt-get update which could get thrown away.

Caching proxies can also increase the window in which this situation is impactful.

Ultimately, acquire-by-hash was introduced precisely to make it so that fetching repos that have been updated works smoothly, with the strong expectation that by-hash file contents never change. This issue breaks that assumption.

Possible Implementation

Easiest way is likely just ensuring there are appropriate locks and subsequent api calls block until they can run, even if it means timing out and failing Ideally some system to queue them up properly so they all go out, or potentially a way to cleanly interupt a publish to replace it with a newer publish that would include the previous changes

Potentially some refactoring of how Packages/by-hash files are handled to ensure we never update a by-hash file once its been written, and maybe some additional configuration on how many are retained.

Your Environment

Debian bullseye, aptly 1.5.0

We've seen hash sum mismatch errors sporadically on v1.4.0 that likely had the same cause, it did seem to happen more often after we upgraded to v1.5.0, but I do not have good historical metrics to show this, and it could just be due to different access patterns and the increasing growth of the underlying repositories.

randombenj commented 1 year ago

Probably related to: https://github.com/aptly-dev/aptly/issues/1125

jmunson commented 1 year ago

That definitely looks like the same problem, I'll go ahead and close this as a dupe