aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.94k stars 299 forks source link

Bug with aim.Run when using hash to locate a run #2999

Open HoBeedzc opened 9 months ago

HoBeedzc commented 9 months ago

🐛 Bug

I encountered a bug while using AIM. When attempting to utilize the aim.Run function to locate a run using its hash, I encountered the following error:

>>> aim.Run(run_hash='51031438759943878c6f9808')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
                                        ^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'

To reproduce

It can be reproduced by simply running the following code (I have ensured that this code has already been placed in the target repository).

import aim
aim.Run(run_hash='51031438759943878c6f9808')

Expected behavior

Locate the run with the target hash, just as it is mentioned in the document.

Environment

Additional context

alberttorosyan commented 9 months ago

@HoBeedzc thanks for reporting this issue. As I understand, this happens only for the remote tracking server?

HoBeedzc commented 9 months ago

Yes, I am using the remote tracking server. For privacy considerations, I've obscured the IP address of the aim remote repository. The actual code is as follows:

import aim
aim.Run(run_hash='51031438759943878c6f9808', repo="aim:ip:port")

I've experimented with various methods, involving both local and remote repositories, and I've arrived at the following findings:

sandeep-biddala commented 5 months ago

Is there any solution for this? I have also been getting this error when I try to identify a run in a remote repository. I have tried closing and releasing the locks, but nothing seems to help.

Has this been fixed in later versions? I am using 3.17.5

inc0 commented 5 months ago

This looks to me like this function is timing out, then error is incorrectly handled. Having short timeout on lock function is definitely a bug, since well, whole idea is to wait until it's safe to acquire a lock.

inc0 commented 5 months ago

This also happens for us when running torch-lightning integration and parallel training jobs.

inc0 commented 5 months ago

Digging deeper into it. I actually no longer think it's issue with grpc, but softlock instead. For some reason, it uses softlock, but this function, when called on repo location, returns False

In [3]: FileSystemInspector.needs_soft_lock("/aim/repo/locks")
Out[3]: False

In [4]: FileSystemInspector.needs_soft_lock("/aim/")
Out[4]: False

In [5]: FileSystemInspector.needs_soft_lock("/aim")
The lock file /aim is on a filesystem of type `overlay` (device id: 207). Using soft file locks to avoid potential data corruption.
Out[5]: True

In [6]: FileSystemInspector.needs_soft_lock("/aim/repo")
Out[6]: False

In [7]:
Do you really want to exit ([y]/n)? y
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks# ls
0b988510edc143148283748b.softlock  6390468e64c24d68b30e9198.softlock  73f944123d904a048019f255.softlock  index
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks#

There might be bug somewhere in either softlock mechanism itself or detecting correct locktype to use. Notice that /aim is overlay - this server runs on top of Kubernetes, so it'll use container fs on root, there is volume mounted to /aim, that's ext4, so dirs under /aim should be able to use locks.

sandeep-biddala commented 5 months ago

@inc0 Does the fix you linked, work for all the cases or only when pytorch lightning is used?

inc0 commented 5 months ago

It'll only fix lightning, but you can add similar parameter to yours func call, it should fix your case