aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.94k stars 299 forks source link

`aim up` hangs forever after a server crash #3056

Open jeffwillette opened 8 months ago

jeffwillette commented 8 months ago

🐛 Bug

aim up hangs forever without starting the server. This happened after a server crash while aim was running. I think nothing was being written during the crash, so it was most likely an idling aim instance.

To reproduce

I don't know if this can be reliably reproduced

Expected behavior

I would expect aim to startup normally after a server crash

Environment

Additional context

aim --verbose up --log-level INFO output:

Verbose mode is on
INFO  [sqlalchemy.engine.Engine] PRAGMA main.table_info("alembic_version")
INFO  [sqlalchemy.engine.Engine] [raw sql] ()
### forever hang
SGevorg commented 8 months ago

@jeffwillette thanks for raising the issue. @mihran113 @alberttorosyan please take a look at this whenever you can.

jeffwillette commented 7 months ago

It appears to be a problem with sqlite. the run_metadata.sqlite database is showing that it is locked even though there should be no process which is writing to it. This must somehow be a result of the crash.

I left the forever hang go for a while and it ended in this error with this triggering many more exceptions which end in the same error:

Traceback (most recent call last):                                                                                                                                                
  File "/c2/jeff/anaconda3/envs/set-ssl/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context                                             
    self.dialect.do_execute(                                                                                                                                                      
  File "/c2/jeff/anaconda3/envs/set-ssl/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute                                                 
    cursor.execute(statement, parameters)                                                                                                                                         
sqlite3.OperationalError: database is locked

I tried manually unlockin/backing up/dumping the database which I read should unlock it, but it doesn't. I moved trhe database to another filesystem, and I was able to read from it using sqlite3, but once I move it back to the location in the .aim folder, it is still locked.

This means it must have something to do with the filesystem, which is a NFS. The thing is, there is no process which is actively connected to the database, so something must have persisted from the crash to keep it locked, but I cannot find what it is. Any ideas?

jeffwillette commented 7 months ago

Seems related to #1865.

Also sqlite recommends not running a db on an NFS (https://www.sqlite.org/faq.html, https://www.sqlite.org/howtocorrupt.html). If you google around for this topic, it comes up with "don't do it" almost everywhere.

alberttorosyan commented 7 months ago

Hey @jeffwillette! Thanks for the additional input. Looking into this issue now. Is there any additional output when you run aim up --log-level DEBUG? On a separate note, do you recall when the crash happened? It could be a separate issue or somehow related to this one.

jeffwillette commented 7 months ago

@alberttorosyan, I think it was only the trace I posted above. I am almost certain this issue comes down to NFS and sqlite clashing with each other, but I wasn't sure how to proceed so I just had to start over and delete the old repo (lucky there was nothing crucial in there).

The crash happened, right before the problem came up. Servers unexpectedly lost power in a power outage and when I got back and tried to fire up aim again, I was confronted with this error.

I think this might be quite dangerous for those running on an NFS. If anyone runs into this problem in the future, the only way I was able to get the sqlite database to unlock was to copy the file to a non-NFS drive and then I was open to open the db manually to inspect the tables. So if any important information were in there, I guess the whole aim repo could be copied to the drive and it should theoretically work again.

Anyway, if this is determined to solely be a sqlite/NFS issue, then feel free to close the issue.