allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
232 stars 90 forks source link

clearml-agent with poetry crash #134

Open idantene opened 1 year ago

idantene commented 1 year ago

See detailed Slack thread.

The tldr is that, at least when running as an autoscaler, the clearml-agent fails with poetry as the package manager with the following error:

Traceback (most recent call last):
  File "/clearml_agent_venv/bin/poetry", line 5, in <module>
    from poetry.console.application import main
ModuleNotFoundError: No module named 'poetry'

Now, poetry is installed on the machine (I've tried various approaches, see the thread) - all requirements are installed via poetry. The environment also correctly identifies the poetry CLI as existing in the remote machine, as it tries to run poetry run python -u train.py:

Running task id [20af069dc8bf49e49db28e662b7804f0]:
[.]$ poetry run python -u train.py
Summary - installed python packages:
pip:
- 'aiobotocore==2.3.4 # Async client for aws services usin...'
- 'aiohttp==3.8.3 # Async http client/server framework...'
<... many more packages>

Environment setup completed successfully

Starting Task Execution:

Traceback (most recent call last):
  File "/clearml_agent_venv/bin/poetry", line 5, in <module>
    from poetry.console.application import main
ModuleNotFoundError: No module named 'poetry'
Process failed, exit code 1

From what I gather, all the various agent settings apply to the top-level agent environment (/clearml_agent_venv/), but when this agent daemon creates the environment for the individual tasks, the poetry module is missing (even if, e.g., it is mentioned in clearml.conf under the agent.package_manager.priority_packages.

idantene commented 1 year ago

One thought that came to mind, is that when setting e.g. agent.package_manager.system_site_packages = true, then this does not apply to virtual environment created with poetry to begin-with. It should update those too, probably.

jkhenning commented 1 year ago

Hi @idantene,

Following up here...

then this does not apply to virtual environment created with poetry to begin-with

poetry needs to be installed on the system packages, and by default poetry will Not inherit from the system packages, it creates a new venv. Also could you provide the full log? I'm not sure where exactly are you getting the missing module

idantene commented 1 year ago

I understand @jkhenning, and as mentioned, it is installed on the system packages.

The above is the full log, unfortunately (I only trimmed the gazillion packages installed with poetry). It pops up when clearml-agent is trying to run poetry run python -u train.py (right before the list of installed packages).

jkhenning commented 1 year ago

@idantene can you get the machine log itself? You should be able to from the AWS cloud console

idantene commented 1 year ago

Sure, there's nothing of interest there though. It installs the packages, poetry, etc, and then runs the agent to listen to the queue.

Is there anything in particular you're after?

jkhenning commented 1 year ago

I would really like to see what's installed and where

idantene commented 1 year ago

@jkhenning The AWS autoscaler uses an Ubuntu 22.04 AMI, and then it runs the following on new instances (the content of extra_vm_bash_script; comments added for ease of reading):

# Some dependencies we need for our libraries
apt-get install -y gfortran libopenblas-dev liblapack-dev libpq-dev python-is-python3 python3-pip python3-dev proj-bin libgraphviz-dev graphviz graphviz-dev libgdal-dev
# Ensuring Python 3.7, 3.8, 3.9, and 3.10 (comes with Ubuntu 22.04) are available
apt-get install software-properties-common -y
add-apt-repository ppa:deadsnakes/ppa -y
apt update
apt install python3.7 python3.8 python3.9 python3.7-distutils python3.8-distutils python3.9-distutils python3.10-distutils python3.7-dev python3.8-dev python3.9-dev python3.10-dev -y
# We use the https+git authorization rather than the ssh+git
git config --system credential.helper \"store --file /root/.git-credentials\"
# Ensure a virtualenv exists for each python; needed because `clearml-agent` calls `pythonX.Y -m virtualenv` (see clearml_agent/helper/package/pip_api/venv.py#L68)
python3.7 -m pip install virtualenv
python3.8 -m pip install virtualenv
python3.9 -m pip install virtualenv
python3.10 -m pip install virtualenv
# Install poetry (as per official docs)
curl -sSL https://install.python-poetry.org | python3 -
export PATH=\"/root/.local/bin:${PATH}\"
# Export environment variables for easy AWS access
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
jkhenning commented 1 year ago

@idantene is this the contents of the extra_vm_bash_script or the actual installation log of this script being executed?

idantene commented 1 year ago

This is the contents of extra_vm_bash_script as mentioned above. This is the only preparations we do on our end, everything else is taken care of by clearml-agent.

jkhenning commented 1 year ago

OK, but can you get the actual system log from the AWS machine? it should include the execution output of this script (as well as other stuff)

idantene commented 1 year ago

I will have to find a moment to redo the autoscaler so that it uses poetry (we've since reverted back to pip for the meantime), but when I did look at it - everything went fine (no errors or warnings). It installed clearml-agent v1.5.1, and ran these additional steps.

I'll get back to you with a raw system log at a later time.