Open jeffwillette opened 1 year ago
@jeffwillette thanks for raising the issue. @mihran113 @alberttorosyan please take a look at this whenever you can.
It appears to be a problem with sqlite. the run_metadata.sqlite database is showing that it is locked even though there should be no process which is writing to it. This must somehow be a result of the crash.
I left the forever hang go for a while and it ended in this error with this triggering many more exceptions which end in the same error:
Traceback (most recent call last):
File "/c2/jeff/anaconda3/envs/set-ssl/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
self.dialect.do_execute(
File "/c2/jeff/anaconda3/envs/set-ssl/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked
I tried manually unlockin/backing up/dumping the database which I read should unlock it, but it doesn't. I moved trhe database to another filesystem, and I was able to read from it using sqlite3, but once I move it back to the location in the .aim folder, it is still locked.
This means it must have something to do with the filesystem, which is a NFS. The thing is, there is no process which is actively connected to the database, so something must have persisted from the crash to keep it locked, but I cannot find what it is. Any ideas?
Seems related to #1865.
Also sqlite recommends not running a db on an NFS (https://www.sqlite.org/faq.html, https://www.sqlite.org/howtocorrupt.html). If you google around for this topic, it comes up with "don't do it" almost everywhere.
Hey @jeffwillette! Thanks for the additional input. Looking into this issue now.
Is there any additional output when you run aim up --log-level DEBUG
?
On a separate note, do you recall when the crash happened? It could be a separate issue or somehow related to this one.
@alberttorosyan, I think it was only the trace I posted above. I am almost certain this issue comes down to NFS and sqlite clashing with each other, but I wasn't sure how to proceed so I just had to start over and delete the old repo (lucky there was nothing crucial in there).
The crash happened, right before the problem came up. Servers unexpectedly lost power in a power outage and when I got back and tried to fire up aim again, I was confronted with this error.
I think this might be quite dangerous for those running on an NFS. If anyone runs into this problem in the future, the only way I was able to get the sqlite database to unlock was to copy the file to a non-NFS drive and then I was open to open the db manually to inspect the tables. So if any important information were in there, I guess the whole aim repo could be copied to the drive and it should theoretically work again.
Anyway, if this is determined to solely be a sqlite/NFS issue, then feel free to close the issue.
🐛 Bug
aim up
hangs forever without starting the server. This happened after a server crash while aim was running. I think nothing was being written during the crash, so it was most likely an idling aim instance.To reproduce
I don't know if this can be reliably reproduced
Expected behavior
I would expect aim to startup normally after a server crash
Environment
Additional context
aim --verbose up --log-level INFO
output: