When executing tasks using the clearml-agent within a Docker container, we encounter a failure during operations that attempt to write to the .gitconfig file. Specifically, the command git config --global --replace-all safe.directory '*' fails with the error message could not write config file /root/.gitconfig: Device or resource busy. This issue persists even though manual tests for file access, read, and write operations to /root/.gitconfig succeed when performed within the container.
The failure to write to .gitconfig seems to occur only during the execution of automated tasks by clearml-agent, suggesting a possible issue with how file access or locking is managed in the context of Docker containers orchestrated by clearml-agent.
Steps to Reproduce
Execute a clearml-agent task within a Docker container that requires Git operations.
The task fails when attempting to globally configure Git to recognize all directories as safe, with the specific command being git config --global --replace-all safe.directory '*'.
Additional Context
We have enabled GIT_TRACE=1 for more detailed output on Git operations.
The issue appears to be related to the clearml-agent's interaction with the .gitconfig file within Docker containers, particularly concerning file locking or access permissions.
Deleting the vcs_cache directory allows the task to proceed successfully, suggesting the problem may be linked to the caching mechanism or file access within this cache.
This behavior raises concerns about potential issues with file locking, .gitconfig access, or interactions between Docker, the clearml-agent, and Git within the containerized environment.
The agent is running on an EC2 instance and we are using environment variables to configure the agent:
- We pass the pat token to the environment `CLEARML_AGENT_GIT_PASS`
### Environment
- clearml-agent version: 1.7.0
- Docker image: `python:3.10-slim`
- Host OS: Ubuntu 22.04
### Error Logs
```python
::: Using Cached environment /root/.clearml/venvs-cache/d99b7ac78c9f00157b7d88b26e395d7e :::
11:27:21.197479 git.c:460 trace: built-in: git config --global --replace-all safe.directory '*'
error: could not write config file /root/.gitconfig: Device or resource busy
Using cached repository in "/root/.clearml/vcs-cache/md-ap-feature-engineering.git.07c9b3f5f387de85ee33f17cae806c1f/md-ap-feature-engineering.git"
11:27:21.200445 git.c:460 trace: built-in: git fetch --all --recurse-submodules
11:27:21.200831 run-command.c:655 trace: run_command: GIT_DIR=.git git remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.201988 git.c:750 trace: exec: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.202020 run-command.c:655 trace: run_command: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
fatal: could not read Username for 'https://github.com': terminal prompts disabled
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 87, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 83, in main
return run_command(parser, args, command_name)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 46, in run_command
return func(**args_dict)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/base.py", line 63, in newfunc
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2611, in execute
directory, vcs, repo_info = self.get_repo_info(
File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2883, in get_repo_info
vcs, repo_info = self._get_repo_info(execution, task, venv_folder)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2919, in _get_repo_info
vcs, repo_info = clone_repository_cached(
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 781, in clone_repository_cached
vcs.pull()
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 599, in pull
self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 659, in call
return self._git_pass_auth_wrapper(super(Git, self).call, *argv, **kwargs)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 612, in _git_pass_auth_wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 435, in call
return self._call_subprocess(subprocess.check_call, argv, **kwargs)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 495, in _call_subprocess
return command.call_subprocess(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/process.py", line 246, in call_subprocess
return func(list(self), *args, **kwargs)
File "/usr/local/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128
Potential Areas for Investigation
Interactions between Docker volume mounts (especially for .gitconfig and vcs_cache) and the clearml-agent's file handling.
How the clearml-agent manages Git configurations and operations within Docker containers, particularly regarding global settings and cached environments.
Description
When executing tasks using the clearml-agent within a Docker container, we encounter a failure during operations that attempt to write to the
.gitconfig
file. Specifically, the commandgit config --global --replace-all safe.directory '*'
fails with the error messagecould not write config file /root/.gitconfig: Device or resource busy
. This issue persists even though manual tests for file access, read, and write operations to/root/.gitconfig
succeed when performed within the container.The failure to write to
.gitconfig
seems to occur only during the execution of automated tasks by clearml-agent, suggesting a possible issue with how file access or locking is managed in the context of Docker containers orchestrated by clearml-agent.Steps to Reproduce
git config --global --replace-all safe.directory '*'
.Additional Context
GIT_TRACE=1
for more detailed output on Git operations..gitconfig
file within Docker containers, particularly concerning file locking or access permissions.vcs_cache
directory allows the task to proceed successfully, suggesting the problem may be linked to the caching mechanism or file access within this cache..gitconfig
access, or interactions between Docker, the clearml-agent, and Git within the containerized environment.Collect all environment variables starting with CLEARML and join them with a comma
CLEARML_ENV_VARS=$(env | grep ^CLEARML | cut -d '=' -f 1 | tr '\n' ',' | sed 's/,$//')
Set the CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV variable with the collected names
export CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV=$CLEARML_ENV_VARS export CLEARML_WORKER_NAME="" export CLEARML_WORKER_ID="" export CLEARML_AGENT_EXTRA_DOCKER_ARGS=""
Potential Areas for Investigation
.gitconfig
andvcs_cache
) and the clearml-agent's file handling.