Open HoBeedzc opened 9 months ago
@HoBeedzc thanks for reporting this issue. As I understand, this happens only for the remote tracking server?
Yes, I am using the remote tracking server. For privacy considerations, I've obscured the IP address of the aim remote repository. The actual code is as follows:
import aim
aim.Run(run_hash='51031438759943878c6f9808', repo="aim:ip:port")
I've experimented with various methods, involving both local and remote repositories, and I've arrived at the following findings:
aim.Run.close()
method) before utilizing the hash to fetch the run again.Is there any solution for this? I have also been getting this error when I try to identify a run in a remote repository. I have tried closing and releasing the locks, but nothing seems to help.
Has this been fixed in later versions? I am using 3.17.5
This looks to me like this function is timing out, then error is incorrectly handled. Having short timeout on lock
function is definitely a bug, since well, whole idea is to wait until it's safe to acquire a lock.
This also happens for us when running torch-lightning integration and parallel training jobs.
Digging deeper into it. I actually no longer think it's issue with grpc, but softlock instead. For some reason, it uses softlock, but this function, when called on repo location, returns False
In [3]: FileSystemInspector.needs_soft_lock("/aim/repo/locks")
Out[3]: False
In [4]: FileSystemInspector.needs_soft_lock("/aim/")
Out[4]: False
In [5]: FileSystemInspector.needs_soft_lock("/aim")
The lock file /aim is on a filesystem of type `overlay` (device id: 207). Using soft file locks to avoid potential data corruption.
Out[5]: True
In [6]: FileSystemInspector.needs_soft_lock("/aim/repo")
Out[6]: False
In [7]:
Do you really want to exit ([y]/n)? y
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks# ls
0b988510edc143148283748b.softlock 6390468e64c24d68b30e9198.softlock 73f944123d904a048019f255.softlock index
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks#
There might be bug somewhere in either softlock mechanism itself or detecting correct locktype to use. Notice that /aim
is overlay - this server runs on top of Kubernetes, so it'll use container fs on root, there is volume mounted to /aim
, that's ext4, so dirs under /aim
should be able to use locks.
@inc0 Does the fix you linked, work for all the cases or only when pytorch lightning is used?
It'll only fix lightning, but you can add similar parameter to yours func call, it should fix your case
🐛 Bug
I encountered a bug while using AIM. When attempting to utilize the
aim.Run
function to locate a run using its hash, I encountered the following error:To reproduce
It can be reproduced by simply running the following code (I have ensured that this code has already been placed in the target repository).
Expected behavior
Locate the run with the target hash, just as it is mentioned in the document.
Environment
Additional context