Open twni2016 opened 2 years ago
@twni2016 Aim has a natural limitation which restricts to write to the same run from multiple parallel clients. It seems ntasks
argument runs parallel threads, which is causing the issue. Is there a way in slurm to configure the workflow to initialize the aim.Run only once and then use it as a shared resource between the threads?
@alberttorosyan tagging you so you are aware of this thread.
@gorarakelyan Sorry I don't quite get your point on initializing aim once for all the tasks. Because in my case each task is an independent run, which should be initialized separately?
@twni2016 ah, my bad, for a moment I thought ntasks executes parallel threads which try to write to the same run.
Most probably this is purely related to the issue with runs locking mechanism, the fix is in-progress now. A similar issue was reported last week as well. The plan is to release a patch fix in the coming days. I will let you know, once the fix is shipped.
Hi @gorarakelyan I found that even when the --ntask 1
(a single thread) for sbatch, the timeout error can also happen.
The error happens when multiple slurm jobs are started simutaneously (the --ntask
of each slurm job is not important), then they race the lock.
Hope that aim could fix this issue, thanks!
Since the timeout error happens within https://github.com/aimhubio/aim/blob/main/aim/sdk/repo.py#L138, can we just set https://github.com/aimhubio/aim/blob/main/aim/sdk/repo.py#L136 timeout
to be a larger value to prevent error?
This would be an easy workaround.
@twni2016 thanks for sharing the details. Yes, the suggested workaround should work. But I guess it shouldn't be considered as a sustainable solution. @alberttorosyan the latest optimizations wrt indexing should solve this issue as well, no?
I am trying to use AIM with Fairseq when running using DDP on multiple gpus. I am facing the same issue. Is there a way to fix this ?
🐛 Bug
When
sbatch --ntasks n
withn
larger than 1, there is race between multiple threads, each of which wants to create a run. Sometimes the race will cause the error of.repo_lock.softlock
could not be acquired.To reproduce
Run the slurm command
sbatch --ntask 2
for a python script, and it might cause an error when creating a runrun=Run()
:Expected behavior
No such error.
Environment