aptly-dev / aptly

aptly - Debian repository management tool
https://www.aptly.info/
MIT License
2.56k stars 371 forks source link

Can no longer find snapshots/mirrors/published endpoints after Ctrl+C during snapshot drop #818

Open esko997 opened 5 years ago

esko997 commented 5 years ago

After fat fingering a running snapshot prune process ( aptly snapshot drop ), I am unable to view anything in the relevant aptly instance. The snapshot in question was not being used by a published endpoint.

I got the following error when first running a script: ERROR: unable to load list of repos: snapshot with uuid e87d3b31-a1d9-4e3d-aef6-849ac840f7c1 not found

After, I tried the following and got the below results:

user@hostname:/srv/wwws/packages/mirror# aptly publish list -config=/etc/aptly/aptlyMirror.conf
No snapshots/local repos have been published. Publish a snapshot by running aptly publish snapshot ...

user@hostname:/srv/wwws/packages/mirror# aptly snapshot list -config=/etc/aptly/aptlyMirror.conf
No snapshots found, create one with aptly snapshot create...

user@hostname:/srv/wwws/packages/mirror# aptly mirror list -config=/etc/aptly/aptlyMirror.conf
No mirrors found, create one with aptly mirror create ...

I have already run aptly db recover which ran successfully (I backed up the database before this). Additionally, there are still 10262 files in the db/ directory.

This aptly server is currently running on Ubuntu 16.04 and was recently upgraded from aptly 0.9.7 to aptly 1.3.0.

Any assistance is greatly appreciated.

esko997 commented 5 years ago

I solved this issue by restoring a backup of the db, but figured maybe this information is valuable anyways.

esko997 commented 5 years ago

After continuing to run into various instances of this issue I seem to have found a fix (maybe better classified as a work around). As it turns out the fact that I Ctrl+C'd the previous running process did not have anything to do with the outcome.

I've included a sleep 1 between aptly snapshot drop calls and the current running job appears to be getting further along than previous tests. I will confirm/update again after the current pruning job completes.

esko997 commented 5 years ago

After doing some more testing, this seems to have been a levelDB ulimit issue. Will update again with final confirmation, but it looks like up'ing the ulimit on the server in question has resolved the issue.

karras commented 4 years ago

We've just encountered the same issue on our production Aptly server after 1-2 years in operations. Several daily and weekly snapshots are created including a full Ubuntu mirror and a few repos. The DB is currently 3GB of size.

The problem first occurred when our monitoring check script reported that the MANIFEST was corrupted. After running aptly db recover we could no longer see any resources at, all mirrors, repos, etc. were gone.

Aptly didn't complain about any ulimit limitations etc. but strace gave away some hints into that direction. Based on this GH issue we decided to increase the max open files limit from 1024 to 32768 which fixed everything:

repo       soft    nofile    32768
repo       hard    nofile    32768

Thanks a lot for the hint @esko997 !

As a feature request could Aptly be more verbose about such issues and not just continue to "work"? Or did I miss some log entries or similar?

esko997 commented 4 years ago

@karras it sounds like we have very similar aptly deployments.

For the sake of posterity, the final iteration of the snapshot prune script that started this thread includes sleeps between every n number of snapshots dropped (where n is something like total snapshots / 10), to give the system time to close open file handles. We've seen lower overall prune job times with this addition to the pruner script, hopefully this information helps.

JanReimerD commented 4 years ago

Hi all, I ran into this issue as well. My script was deleting thousands of snapshots in a loop. Suddenly, everything was gone. I'm using aply 1.4.0. I have a backup and @esko997 and @karras helped a lot with their comments. I will try their workarounds. However, I think that this issues is worth being investigated and solved since data loss is critical. Are there any plans in that direction? Best Jan

rzippert commented 3 years ago

Using aptly 1.4.0 here in a CI/CD environment. Started to get these issues today, after about 2 years of constant use. We don't keep old snapshots and run constant db cleanups, but still have reached this limit. I'm unsure if this should be considered an aptly issue, unless this actually is happening due to some kind of "leak" in the DB. A bigger DB should be expected to require more open files... Perhaps a note on the readme with advice on the open file limits?

esko997 commented 3 years ago

@rzippert I agree this is not /really/ an aptly bug. With that said, it can be scary from a user perspective when it looks like you're entire deployment disappeared. I think some kind of communication around it would be useful.

That might come in the form of either a note in the documentation about awareness of open file limits, or a warning/error if aptly can't open a file. I'm not familiar with the codebase or golang but I'll take a look and if I feel up to it submit a PR along the above lines.