`.repo_lock.softlock` could not be acquired when multiple slurm jobs are started

twni2016 commented 2 years ago

🐛 Bug

When sbatch --ntasks n with n larger than 1, there is race between multiple threads, each of which wants to create a run. Sometimes the race will cause the error of .repo_lock.softlock could not be acquired.

To reproduce

Run the slurm command sbatch --ntask 2 for a python script, and it might cause an error when creating a run run=Run():

  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/run.py", line 287, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    raise Timeout(self._lock_file)
    raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.

Expected behavior

No such error.

Environment

Aim Version 3.14.1
Python version 3.10
pip version 22.2.2
OS Linux

gorarakelyan commented 2 years ago

@twni2016 Aim has a natural limitation which restricts to write to the same run from multiple parallel clients. It seems ntasks argument runs parallel threads, which is causing the issue. Is there a way in slurm to configure the workflow to initialize the aim.Run only once and then use it as a shared resource between the threads?

@alberttorosyan tagging you so you are aware of this thread.

twni2016 commented 2 years ago

@gorarakelyan Sorry I don't quite get your point on initializing aim once for all the tasks. Because in my case each task is an independent run, which should be initialized separately?

gorarakelyan commented 2 years ago

@twni2016 ah, my bad, for a moment I thought ntasks executes parallel threads which try to write to the same run.

Most probably this is purely related to the issue with runs locking mechanism, the fix is in-progress now. A similar issue was reported last week as well. The plan is to release a patch fix in the coming days. I will let you know, once the fix is shipped.

twni2016 commented 2 years ago

Hi @gorarakelyan I found that even when the --ntask 1 (a single thread) for sbatch, the timeout error can also happen.

The error happens when multiple slurm jobs are started simutaneously (the --ntask of each slurm job is not important), then they race the lock.

Hope that aim could fix this issue, thanks!

twni2016 commented 1 year ago

Since the timeout error happens within https://github.com/aimhubio/aim/blob/main/aim/sdk/repo.py#L138, can we just set https://github.com/aimhubio/aim/blob/main/aim/sdk/repo.py#L136 timeout to be a larger value to prevent error?

This would be an easy workaround.

gorarakelyan commented 1 year ago

@twni2016 thanks for sharing the details. Yes, the suggested workaround should work. But I guess it shouldn't be considered as a sustainable solution. @alberttorosyan the latest optimizations wrt indexing should solve this issue as well, no?

harishankar-gopalan commented 1 year ago

I am trying to use AIM with Fairseq when running using DDP on multiple gpus. I am facing the same issue. Is there a way to fix this ?

aimhubio / aim