allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
241 stars 92 forks source link

error: could not write config file /root/.gitconfig: Device or resource busy - running clearml-agent in docker mode #193

Open AH-Merii opened 8 months ago

AH-Merii commented 8 months ago

Description

When executing tasks using the clearml-agent within a Docker container, we encounter a failure during operations that attempt to write to the .gitconfig file. Specifically, the command git config --global --replace-all safe.directory '*' fails with the error message could not write config file /root/.gitconfig: Device or resource busy. This issue persists even though manual tests for file access, read, and write operations to /root/.gitconfig succeed when performed within the container.

The failure to write to .gitconfig seems to occur only during the execution of automated tasks by clearml-agent, suggesting a possible issue with how file access or locking is managed in the context of Docker containers orchestrated by clearml-agent.

Steps to Reproduce

  1. Execute a clearml-agent task within a Docker container that requires Git operations.
  2. The task fails when attempting to globally configure Git to recognize all directories as safe, with the specific command being git config --global --replace-all safe.directory '*'.

Additional Context

Collect all environment variables starting with CLEARML and join them with a comma

CLEARML_ENV_VARS=$(env | grep ^CLEARML | cut -d '=' -f 1 | tr '\n' ',' | sed 's/,$//')

Set the CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV variable with the collected names

export CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV=$CLEARML_ENV_VARS export CLEARML_WORKER_NAME="" export CLEARML_WORKER_ID="" export CLEARML_AGENT_EXTRA_DOCKER_ARGS=""

- We pass the pat token to the environment `CLEARML_AGENT_GIT_PASS`

### Environment
- clearml-agent version: 1.7.0
- Docker image: `python:3.10-slim`
- Host OS: Ubuntu 22.04

### Error Logs

```python
::: Using Cached environment /root/.clearml/venvs-cache/d99b7ac78c9f00157b7d88b26e395d7e :::
11:27:21.197479 git.c:460               trace: built-in: git config --global --replace-all safe.directory '*'
error: could not write config file /root/.gitconfig: Device or resource busy
Using cached repository in "/root/.clearml/vcs-cache/md-ap-feature-engineering.git.07c9b3f5f387de85ee33f17cae806c1f/md-ap-feature-engineering.git"
11:27:21.200445 git.c:460               trace: built-in: git fetch --all --recurse-submodules
11:27:21.200831 run-command.c:655       trace: run_command: GIT_DIR=.git git remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.201988 git.c:750               trace: exec: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.202020 run-command.c:655       trace: run_command: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
fatal: could not read Username for 'https://github.com': terminal prompts disabled
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 87, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 83, in main
    return run_command(parser, args, command_name)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 46, in run_command
    return func(**args_dict)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/base.py", line 63, in newfunc
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2611, in execute
    directory, vcs, repo_info = self.get_repo_info(
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2883, in get_repo_info
    vcs, repo_info = self._get_repo_info(execution, task, venv_folder)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2919, in _get_repo_info
    vcs, repo_info = clone_repository_cached(
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 781, in clone_repository_cached
    vcs.pull()
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 599, in pull
    self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 659, in call
    return self._git_pass_auth_wrapper(super(Git, self).call, *argv, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 612, in _git_pass_auth_wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 435, in call
    return self._call_subprocess(subprocess.check_call, argv, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 495, in _call_subprocess
    return command.call_subprocess(func, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/process.py", line 246, in call_subprocess
    return func(list(self), *args, **kwargs)
  File "/usr/local/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128

Potential Areas for Investigation

jkhenning commented 8 months ago

Hi @AH-Merii,

Are you running the container as non-root?

AH-Merii commented 8 months ago

Hey @jkhenning,

No the user in the container is running as root.

jkhenning commented 7 months ago

Hi @AH-Merii,

Try deleting ~/.gitconfig on the host machine and see if it works