RamenDR / ramen

Apache License 2.0
70 stars 51 forks source link

drenv cache fetch can be racy and fail if one runner deletes the same named .tmp file #1386

Closed ShyamsundarR closed 2 months ago

ShyamsundarR commented 2 months ago

e.g The following run failed stating .tmp file was not found (occurred twice in about 6-7 runs). Looking at the code, as both runners use the same tmp file location it is possible that on a race one of the runners deleted the tmp file and later the other runner attempted to rename the same tmp file and failed:

2024-05-10 00:30:50,091 INFO    [cephfs] Starting environment
2024-05-10 00:30:50,958 INFO    [dr2] Starting minikube cluster
2024-05-10 00:30:50,978 INFO    [dr1] Starting minikube cluster
2024-05-10 00:31:00,582 INFO    [dr1] Cluster started in 9.60 seconds
2024-05-10 00:31:00,582 INFO    [dr1] Waiting for fresh status
2024-05-10 00:31:00,676 INFO    [dr2] Cluster started in 9.72 seconds
2024-05-10 00:31:00,676 INFO    [dr2] Waiting for fresh status
2024-05-10 00:31:30,582 INFO    [dr1] Looking up failed deployments
2024-05-10 00:31:30,677 INFO    [dr2] Looking up failed deployments
2024-05-10 00:31:36,110 INFO    [dr1/0] Running addons/rook-cephfs/start
2024-05-10 00:31:36,240 INFO    [dr2/0] Running addons/rook-cephfs/start
2024-05-10 00:31:36,966 ERROR   Command failed
Traceback (most recent call last):
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 44, in main
    args.func(args)
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 216, in start
    execute(
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 306, in execute
    f.result()
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 338, in start_cluster
    execute(
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 306, in execute
    f.result()
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 497, in run_worker
    run_addon(addon, worker["name"], hooks=hooks, allow_failure=allow_failure)
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 528, in run_addon
    run_hook(hook, addon["args"], name, allow_failure=allow_failure)
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 539, in run_hook
    run(hook, *args, name=name)
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/__main__.py", line 554, in run
    for line in commands.watch(*cmd):
  File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/commands.py", line 187, in watch
    raise Error(args, error, exitcode=p.returncode)
drenv.commands.Error: Command failed:
   command: ('addons/rook-cephfs/start', 'dr1')
   exitcode: 1
   error:
      Traceback (most recent call last):
        File "/home/.../go/src/github.com/ramendr/ramen/test/addons/rook-cephfs/start", line 46, in <module>
          deploy(cluster)
        File "/home/.../go/src/github.com/ramendr/ramen/test/addons/rook-cephfs/start", line 17, in deploy
          cache.fetch(".", path)
        File "/home/.../go/src/github.com/ramendr/ramen/test/drenv/cache.py", line 28, in fetch
          os.rename(tmp, dest)
      FileNotFoundError: [Errno 2] No such file or directory: '/home/.../.cache/drenv/addons/rook-cephfs.yaml.tmp' -> '/home/.../.cache/drenv/addons/rook-cephfs.yaml'

The env YAML used was:

---
name: "cephfs"

templates:
  - name: "dr-cluster"
    driver: "$vm"
    container_runtime: containerd
    network: "$network"
    memory: "6g"
    extra_disks: 1
    disk_size: "50g"
    workers:
      - addons:
          - name: rook-cephfs

profiles:
  - name: "dr1"
    template: "dr-cluster"
  - name: "dr2"
    template: "dr-cluster"
nirs commented 2 months ago

The expected use case is to refresh the cache from another process, so actual job should never need to refresh the cache. But I think we can avoid the race by using unique temporary file so 2 concurrent fetches will not break each other.