aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
5.18k stars 317 forks source link

ui query returns aimrocks exception with some *.ldb file not found #2950

Open yh-xu opened 1 year ago

yh-xu commented 1 year ago

🐛 Bug

ui query failed from time to time with aimrocks exception. As long as one run returns such error, neither of any other runs or metrics can be queried. Logs from aim ui container: ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 276, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 91, in call await self.simple_response(scope, receive, send, request_headers=headers) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 146, in simple_response await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in call raise e File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 443, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 276, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.9/site-packages/aim/web/api/utils.py", line 56, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 24, in call await responder(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 44, in call await self.app(scope, receive, self.send_with_gzip) File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in call raise e File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 163, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.9/site-packages/aim/web/api/runs/views.py", line 168, in run_metric_batch_api traces_data = collect_requested_metric_traces(run, requested_traces) File "/usr/local/lib/python3.9/site-packages/aim/web/api/runs/utils.py", line 332, in collect_requested_metric_traces trace = run.get_metric(name=metric_name, context=context) File "/usr/local/lib/python3.9/site-packages/aim/sdk/run.py", line 515, in get_metric return self._get_sequence('metric', name, context) File "/usr/local/lib/python3.9/site-packages/aim/sdk/run.py", line 630, in _get_sequence return sequence if bool(sequence) else None File "/usr/local/lib/python3.9/site-packages/aim/sdk/sequence.py", line 332, in bool return bool(self.values) File "aim/storage/treearrayview.py", line 47, in aim.storage.treearrayview.TreeArrayView.bool File "aim/storage/treearrayview.py", line 41, in aim.storage.treearrayview.TreeArrayView.len File "aim/storage/treearrayview.py", line 117, in aim.storage.treearrayview.TreeArrayView.last_idx File "aim/storage/containertreeview.py", line 196, in aim.storage.containertreeview.ContainerTreeView.last_key File "aim/storage/container.py", line 262, in aim.storage.container.Container.prev_key File "aim/storage/container.py", line 266, in aim.storage.container.Container.prev_key File "aim/storage/prefixview.py", line 189, in aim.storage.prefixview.PrefixView.prev_item File "aim/storage/rockscontainer.pyx", line 523, in aim.storage.rockscontainer.RocksContainer.prev_item File "src/aimrocks/lib_rocksdb.pyx", line 2344, in aimrocks.lib_rocksdb.BaseIterator.seek_for_prev File "src/aimrocks/lib_rocksdb.pyx", line 2348, in aimrocks.lib_rocksdb.BaseIterator.seek_for_prev File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: /opt/aim/.aim/seqs/chunks/3c0d640c39134ef58aaab62c/000009.ldb: No such file or directory'

To reproduce

It's not consistently reproducible with same procedure. aim repo is inited on a shared location. Experiment is executed in a container with aim repo location mounted on. aim ui is set up in another container with image aimstack/aim:3.17.5 .

Expected behavior

It should normally return all runs and metrics which have been stored.

Environment

Additional context

What is the missing *.ldb file referring to ? User was not manually creating or editing any files in aim repo. Data store is all managed by aim.

lkhphuc commented 1 year ago

Is there a way to manually recover from this? My database was frequently corrupted and I have to create a backup manually very often. When my manual backup was a bit outdated and this happen, I lost all the runs from my last backup.

In my case this happens very frequently, as simple as having a single run in a new repository from a slurm job, then open Web UI from a my dev machine, and it's crash.

P/s: And I don't mean recover the database for other completed runs like this https://github.com/aimhubio/aim/issues/2869#issuecomment-1676612640 As I said, I already created a new repo for every run and manually move the run to a common one afterward. I want to recover that single unfinished run.