Open esko997 opened 5 years ago
I solved this issue by restoring a backup of the db, but figured maybe this information is valuable anyways.
After continuing to run into various instances of this issue I seem to have found a fix (maybe better classified as a work around). As it turns out the fact that I Ctrl+C'd the previous running process did not have anything to do with the outcome.
I've included a sleep 1
between aptly snapshot drop
calls and the current running job appears to be getting further along than previous tests. I will confirm/update again after the current pruning job completes.
After doing some more testing, this seems to have been a levelDB ulimit issue. Will update again with final confirmation, but it looks like up'ing the ulimit on the server in question has resolved the issue.
We've just encountered the same issue on our production Aptly server after 1-2 years in operations. Several daily and weekly snapshots are created including a full Ubuntu mirror and a few repos. The DB is currently 3GB of size.
The problem first occurred when our monitoring check script reported that the MANIFEST was corrupted. After running aptly db recover
we could no longer see any resources at, all mirrors, repos, etc. were gone.
Aptly didn't complain about any ulimit limitations etc. but strace
gave away some hints into that direction. Based on this GH issue we decided to increase the max open files limit from 1024
to 32768
which fixed everything:
repo soft nofile 32768
repo hard nofile 32768
Thanks a lot for the hint @esko997 !
As a feature request could Aptly be more verbose about such issues and not just continue to "work"? Or did I miss some log entries or similar?
@karras it sounds like we have very similar aptly deployments.
For the sake of posterity, the final iteration of the snapshot prune script that started this thread includes sleeps between every n number of snapshots dropped (where n is something like total snapshots / 10), to give the system time to close open file handles. We've seen lower overall prune job times with this addition to the pruner script, hopefully this information helps.
Hi all, I ran into this issue as well. My script was deleting thousands of snapshots in a loop. Suddenly, everything was gone. I'm using aply 1.4.0. I have a backup and @esko997 and @karras helped a lot with their comments. I will try their workarounds. However, I think that this issues is worth being investigated and solved since data loss is critical. Are there any plans in that direction? Best Jan
Using aptly 1.4.0 here in a CI/CD environment. Started to get these issues today, after about 2 years of constant use. We don't keep old snapshots and run constant db cleanups, but still have reached this limit. I'm unsure if this should be considered an aptly issue, unless this actually is happening due to some kind of "leak" in the DB. A bigger DB should be expected to require more open files... Perhaps a note on the readme with advice on the open file limits?
@rzippert I agree this is not /really/ an aptly bug. With that said, it can be scary from a user perspective when it looks like you're entire deployment disappeared. I think some kind of communication around it would be useful.
That might come in the form of either a note in the documentation about awareness of open file limits, or a warning/error if aptly can't open a file. I'm not familiar with the codebase or golang but I'll take a look and if I feel up to it submit a PR along the above lines.
After fat fingering a running snapshot prune process (
aptly snapshot drop
), I am unable to view anything in the relevant aptly instance. The snapshot in question was not being used by a published endpoint.I got the following error when first running a script:
ERROR: unable to load list of repos: snapshot with uuid e87d3b31-a1d9-4e3d-aef6-849ac840f7c1 not found
After, I tried the following and got the below results:
I have already run aptly db recover which ran successfully (I backed up the database before this). Additionally, there are still 10262 files in the db/ directory.
This aptly server is currently running on Ubuntu 16.04 and was recently upgraded from aptly 0.9.7 to aptly 1.3.0.
Any assistance is greatly appreciated.