Bug with aim.Run when using hash to locate a run

HoBeedzc commented 9 months ago

🐛 Bug

I encountered a bug while using AIM. When attempting to utilize the aim.Run function to locate a run using its hash, I encountered the following error:

>>> aim.Run(run_hash='51031438759943878c6f9808')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
                                        ^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'

To reproduce

It can be reproduced by simply running the following code (I have ensured that this code has already been placed in the target repository).

import aim
aim.Run(run_hash='51031438759943878c6f9808')

Expected behavior

Locate the run with the target hash, just as it is mentioned in the document.

Environment

Aim Version 3.17.5
Python version 3.11.4 (conda)
pip version 23.1.2
OS (problems with both Linux and Mac)
Any other relevant information

Additional context

alberttorosyan commented 9 months ago

@HoBeedzc thanks for reporting this issue. As I understand, this happens only for the remote tracking server?

HoBeedzc commented 9 months ago

Yes, I am using the remote tracking server. For privacy considerations, I've obscured the IP address of the aim remote repository. The actual code is as follows:

import aim
aim.Run(run_hash='51031438759943878c6f9808', repo="aim:ip:port")

I've experimented with various methods, involving both local and remote repositories, and I've arrived at the following findings:

When working with a local repository, the hash can be used to uniquely identify a run. However, if there's already an open run, it must be closed (use aim.Run.close() method) before utilizing the hash to fetch the run again.
However, when working with a remote repository, it's not feasible to employ the hash to identify the run, regardless of whether there are any open runs.

sandeep-biddala commented 5 months ago

Is there any solution for this? I have also been getting this error when I try to identify a run in a remote repository. I have tried closing and releasing the locks, but nothing seems to help.

Has this been fixed in later versions? I am using 3.17.5

inc0 commented 5 months ago

This looks to me like this function is timing out, then error is incorrectly handled. Having short timeout on lock function is definitely a bug, since well, whole idea is to wait until it's safe to acquire a lock.

inc0 commented 5 months ago

This also happens for us when running torch-lightning integration and parallel training jobs.

inc0 commented 5 months ago

Digging deeper into it. I actually no longer think it's issue with grpc, but softlock instead. For some reason, it uses softlock, but this function, when called on repo location, returns False

In [3]: FileSystemInspector.needs_soft_lock("/aim/repo/locks")
Out[3]: False

In [4]: FileSystemInspector.needs_soft_lock("/aim/")
Out[4]: False

In [5]: FileSystemInspector.needs_soft_lock("/aim")
The lock file /aim is on a filesystem of type `overlay` (device id: 207). Using soft file locks to avoid potential data corruption.
Out[5]: True

In [6]: FileSystemInspector.needs_soft_lock("/aim/repo")
Out[6]: False

In [7]:
Do you really want to exit ([y]/n)? y
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks# ls
0b988510edc143148283748b.softlock  6390468e64c24d68b30e9198.softlock  73f944123d904a048019f255.softlock  index
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks#

There might be bug somewhere in either softlock mechanism itself or detecting correct locktype to use. Notice that /aim is overlay - this server runs on top of Kubernetes, so it'll use container fs on root, there is volume mounted to /aim, that's ext4, so dirs under /aim should be able to use locks.

sandeep-biddala commented 5 months ago

@inc0 Does the fix you linked, work for all the cases or only when pytorch lightning is used?

inc0 commented 5 months ago

It'll only fix lightning, but you can add similar parameter to yours func call, it should fix your case

aimhubio / aim