aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
5.16k stars 316 forks source link

runs stuck in progress #2995

Open Laiaborrell opened 1 year ago

Laiaborrell commented 1 year ago

When running aim up.. and checking the runs in the UI looks like 4 out of 5 runs (except for the last one trained, which is tagged as "finished"), which have already finished training, are stuck "in progress" (green dots): image A couple of days ago, when I last checked the state of the trainings, those runs were alredy tagged as finished but somehow they were reactivated now... Because of this, when accessing these runs to check for the metrics and figures, a pop up with the following message appears "Error. Run not found": image Note that no error is printed in the terminal where the aim up command is being run.

I would really appreciate any help, thanks!

mihran113 commented 1 year ago

Hey @Laiaborrell! Thanks a lot for the report, that seems kinda strange, as there's no scenario that runs can reactivate by themselves. My only guess is that the runs were tryed to be deleted, and something went wrong in the process of deletion, that's why it's showing that the runs are not found. As aim stores data about runs in 2 dbs (sqlite and rocksdb). I think that rocksdb portions of the data were removed, and the data in sqlite is still there. You can check if that's the case by checking if the ./aim/meta/chunks/{run_hash} directory still exists or not.

Laiaborrell commented 1 year ago

Hey @mihran113, thanks for your reply!! I checked and the hashes for the runs are still in the chunks folder: image I did not try to delete any of the files either :/ It is weird for me because they appear as active and the run time keeps increasing (8 days now), but the gpu where the process was training has been stopped... Also, the chunk folders' files were also last updated three days ago, when their training finished

Maximiliano-Villanueva commented 10 months ago

Hi @Laiaborrell did you manage to solve this? Because Im having the same issue using langchain callbacks.

Laiaborrell commented 10 months ago

Hello @Maximiliano-Villanueva, I didn't manage to solve it. I had to relaunch the hyperparameter search.... sorry about that. Hope that someone else can help, it would be helpful for any future issues like this.

Michael-Tanzer commented 5 months ago

@mihran113 Do you know if there is any update about this? I also see the run in meta/chunks and I am not able to delete the runs as they appear online on the UI. It looks like restarting the server fixes the issue, I hope this is a useful piece of information in fixing it! Would it be possible to perhaps add a "force delete" button to force deletion of running runs?

ETA: when restarting the server some runs will not be deleted as they are "locked"

mihran113 commented 5 months ago

@Michael-Tanzer Can I ask you to share the logs from aim up command when the error happens(when not found is thrown trying to open the run)? Also if you can share some scenario or example script when this happens so I can reproduce it on my end would be really helpful as well.

Michael-Tanzer commented 5 months ago

I have now deleted the problematic runs by deleting the lock manually and then deleting from the UI. I will share a log as soon as it happens again.

mihran113 commented 5 months ago

Let me know when that happens again, as it's pretty hard to reproduce, but the error should tell a lot about what's happening and it would help a lot.

mihran113 commented 5 months ago

Regarding the force delete, we'll consider to implement it for the next minor version: 3.20.0