Closed amaltaro closed 1 year ago
Correct. There are a number of containers here: https://github.com/dmwm/Docker
They might at least provide a starting point. However, these don't really follow the true containerization model. They tend to try to set up a "WMAgent node" meaning that MySQL is embedded in the same container as the Agent, etc. I could never uses these for unit tests as the IO of using MySQL's database files inside the container was not good enough.
But, once you start to separate things out, you'll be better off.
The containers in that repo are all built automatically on CERN's GitLab infrastructure and made available. We can add new ones if needed.
Correct - it'd be a great start, but we would want to separate out the dependent services (database, HTCondor) into separate containers -- or bind-mount them from the host.
@goughes Erik, as we discussed today, I have some initial work on my own repo: https://github.com/amaltaro/WMCore-Docker
I provided initial Dockerfile to build docker image and it was successfully build, see details in https://github.com/dmwm/WMCore/issues/11310. The individual dockerfiles can be found over here https://github.com/dmwm/CMSKubernetes/tree/master/docker/pypi
CMSKubernetes repo commits for WMCore Dockefiles:
Hi @amaltaro @vkuznet, As this issue (actually part of it) is marked quite high in the T0's section in the quarterly planing document, and we also agreed this should be turned into a meta issue, let me suggest to split it in the following set of issues and start adding bits of it to the 2023/Q2 project board, so we can progress on it:
Alan, Please correct me if I have missed something here or you do not agree with the suggested list. But one thing is for sure though, if we start with the first one we will be able to achieve both:
Hi @amaltaro To continue on this: We actually may already have a set of issues that are covering pieces of this plan as expressed in my previous comment. Here are those that I have found could be related to the respective bullets from above:
wmagent
package uploaded to pipy: [1] Some of those issue's descriptions may need efactoring to include missing pieces, but i am confident we can easily now create a single issue for the first step and move this one as a full meta issue pointing to this set of sub tasks. Please let me know what you think.
[1]
(WMCore.venv3) [user@unit02 WMCore.venv3]$ pip install wmagent
Collecting wmagent
Downloading wmagent-2.2.0.2.tar.gz (1.2 MB)
|████████████████████████████████| 1.2 MB 3.6 MB/s
Preparing metadata (setup.py) ... done
Collecting Cheetah3~=3.2.6.post1
...
Successfully built wmagent
Installing collected packages: cffi, pynacl, cryptography, bcrypt, pyrsistent, pyparsing, pycurl, MarkupSafe, docutils, contextlib2, SQLAlchemy, Sphinx, pyzmq, psutil, mysqlclient, httplib2, htcondor, cx-Oracle, coverage, CherryPy, Cheetah3, wmagent
...
(WMCore.venv3) [user@unit02 WMCore.venv3]$ ls -1 lib/python3.6/site-packages/WMComponent/
AgentStatusWatcher
AnalyticsDataCollector
ArchiveDataReporter
DBS3Buffer
ErrorHandler
__init__.py
JobAccountant
JobArchiver
JobCreator
JobStatusLite
JobSubmitter
JobTracker
JobUpdater
__pycache__
RetryManager
RucioInjector
TaskArchiver
WorkQueueManager
Thank you for looking into this, Todor.
Valentin and I had some discussion today on this project and here is a meta-issue and all its 10 sub-tasks: https://github.com/dmwm/WMCore/issues/11314
If you feel like GH issues need to be updated, please comment on each of them. If you feel like something is missing, please create a new one.
Thank @amaltaro. The actual two issues from the list in the metaissue: https://github.com/dmwm/WMCore/issues/11314, that I am interested in/planning to tackle in short term are the following two:
Those they go hand in hand and could be solved simultaneously I believe. And I am quite close to that point already. We did exchange some info with @germanfgv and he did help me by providing his experience on the matter with the T0 containers (thanks, German, for that). We also have those two previous efforts to produce a fully functional Docker container for the agent:
Provided by @vkuznet and @goughes. Both of those were delayed and did not get into production because of dependencies on solving other pieces from the deployment process migration (like moving away from RPM based deployment scripts or applying configuration files at runtime).
The way of handling the problem here, I can foresee (and I am already working on it), as a combination of the two approaches and split the deploy-wmagent.sh script in two pieces:
install.sh
part wich should cover the pypi deployment process and base directory configuration in the container - this should be referred at build timerun.sh
which should cover the regenerating of the config files at runtime based on externally provided few parameters at startup and also taking care of credential propagation through preset mount points from the host. Some really good example on how those mount points should be used for credential propagation are listed in Erik's documentation to the previous docker image: https://github.com/dmwm/CMSKubernetes/tree/master/docker/wmagent#readme
Just a heads up.
In order to continue on that, some additional requests needs to be done to the VOC and to the SI Team, because we will need docker-ce
installed in at least one of our test agents (if not all of them). And for that I am about to ask permission now in the relevant channels.
FYI: @vkuznet @amaltaro @khurtado
@todor-ivanov
You said this is one of the issues you would like to consider, but it is not related to what we are discussing here. Please let us know what is the correct issue number that you wanted to mention.
... because we will need docker-ce installed in a...
I've seen your email. For proper recording, can you please clarify which node you actually requested this package to be installed to?
Hi @amaltaro,
Please let us know what is the correct issue number that you wanted to mention.
Sorry for the typo the correct issue is:
For proper recording, can you please clarify which node you actually requested this package to be installed to?
Yes, the docker-ce
package was installed on vocms0260
using the so provided puppet module from CERNIT and the non-root user allowed access to the docker engine is cmst1
.
Just to mention here so we can correlate both issues. I believe in the process of solving the current issue we will obtain answers to most of the questions asked here: https://github.com/dmwm/WMCore/issues/11570 answered. Once I have more info, I will update the other issue as well.
FYI: @khurtado @amaltaro @vkuznet
Ok, now this is just a status report.
We have reached an interim state here.
With my latest commits at: https://github.com/dmwm/CMSKubernetes/pull/1364 I can already successfully start (modulo few components) a WMAgent
from a docker image deployed from the wmagent
package uploaded to pypi. The minimum package version that would work is: wmagent==2.2.1rc2
and no earlier.
The configuration is as follows:
vocms0260.cern.ch
with few mount points from the host for preserving state related data and databases through out agent restarts /data/dockerMount
, mounted on /data/
at the container - for safety reasons origin and destination differ/data/srv/wmagent/2.2.1rc2/install/mysql/logs/mysql.sock
cmsweb-testbed.cern.ch
Here is a complete and clean startup log: [1]. For starting the agent I used the so created wrapper: ./wmagent-docker-run.sh &
without specifying any runtime parameters - using only defaults.
Here is the agent status from inside the container: [2]. Unfortunately some of the components were left down during agent startup because we had some broken dependency inside the python package, which I am investigating right now. Here is the log from manage agent-start
[3].
NOTE: Even though we have reached a milestone - to be able to run and manage a basic container deployed from pypi, in order to call this PR: https://github.com/dmwm/CMSKubernetes/pull/1364 ready for review, there is some more work do be done, due to the several nasty hack which I had to implement so far. Here is the list of things to be done:
yui
package one https://github.com/dmwm/WMCore/issues/10597 - we will be flying with this one for a while)manage
script and upload it inside the CMSKubernetes
repository itself instead of downloading it from dmwm/deployment
repository manage
scriptrun.sh
script at runtimewmagent-docker-run.sh
wrpapper such that the root mount point from the host to be parameterized at runtime.localhost
instead of socket, which will then wrongly point to the host .... instead if internal database. This will also reduce future host to docker agent port collisions (for which we were warned that have been problematic for T0 team when they were trying to run rpm based docker containers in a VM)FYI: @amaltaro @vkuznet @khurtado
[3]
(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ manage start-agent
Starting WMAgent...
Checking default database connection... ok.
Starting components: ['WorkQueueManager', 'DBS3Upload', 'JobAccountant', 'JobCreator', 'JobSubmitter', 'JobTracker', 'JobStatusLite', 'JobUpdater', 'ErrorHandler', 'RetryManager', 'JobArchiver', 'TaskArchiver', 'AnalyticsDataCollector', 'ArchiveDataReporter', 'AgentStatusWatcher', 'RucioInjector']
Starting : WorkQueueManager
Starting WorkQueueManager as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/WorkQueueManager
Waiting 1 seconds, to ensure daemon file is created
started with pid 2465
Starting : DBS3Upload
DBS3Upload.__init__
Starting DBS3Upload as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/DBS3Upload
Waiting 1 seconds, to ensure daemon file is created
started with pid 2483
Starting : JobAccountant
Starting JobAccountant as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobAccountant
Waiting 1 seconds, to ensure daemon file is created
started with pid 2501
Starting : JobCreator
JobCreator.__init__
Starting JobCreator as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobCreator
Waiting 1 seconds, to ensure daemon file is created
started with pid 2514
Starting : JobSubmitter
Starting JobSubmitter as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobSubmitter
Waiting 1 seconds, to ensure daemon file is created
started with pid 2528
Starting : JobTracker
JobTracker.__init__
Starting JobTracker as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobTracker
Waiting 1 seconds, to ensure daemon file is created
started with pid 2551
Starting : JobStatusLite
JobStatusLite.__init__
Starting JobStatusLite as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobStatusLite
Waiting 1 seconds, to ensure daemon file is created
started with pid 2570
Starting : JobUpdater
Starting JobUpdater as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobUpdater
Waiting 1 seconds, to ensure daemon file is created
started with pid 2584
Starting : ErrorHandler
Starting ErrorHandler as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/ErrorHandler
Waiting 1 seconds, to ensure daemon file is created
started with pid 2600
Starting : RetryManager
Starting RetryManager as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/RetryManager
Waiting 1 seconds, to ensure daemon file is created
started with pid 2614
Starting : JobArchiver
JobArchiver.__init__
Starting JobArchiver as a daemon
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver
Waiting 1 seconds, to ensure daemon file is created
started with pid 2628
Starting : TaskArchiver
Traceback (most recent call last):
File "/data/srv/wmagent/current/install/wmagent/bin/wmcoreD", line 348, in <module>
startup(config)
File "/data/srv/wmagent/current/install/wmagent/bin/wmcoreD", line 221, in startup
componentObject = factory.loadObject(classname = namespace, args = config)
File "/usr/local/lib/python3.8/site-packages/WMCore/WMFactory.py", line 58, in loadObject
module = __import__(module, globals(), locals(), [classname])
File "/usr/local/lib/python3.8/site-packages/WMComponent/TaskArchiver/TaskArchiver.py", line 20, in <module>
from WMComponent.TaskArchiver.CleanCouchPoller import CleanCouchPoller
File "/usr/local/lib/python3.8/site-packages/WMComponent/TaskArchiver/CleanCouchPoller.py", line 28, in <module>
from WMCore.DataStructs.MathStructs.DiscreteSummaryHistogram import DiscreteSummaryHistogram
ModuleNotFoundError: No module named 'WMCore.DataStructs.MathStructs'
[2]
cmst1@vocms0260:/data/srv/wmagent/current $ docker exec -it wmagent /bin/bash
(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$
(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
cmst1 2628 0.0 0.7 362592 52284 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2614 0.0 0.6 361824 50868 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2600 0.0 0.7 362528 51556 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2584 0.0 0.6 362080 51100 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2570 0.0 0.6 362080 51008 ? Sl 15:15 0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2551 0.0 0.6 362080 51104 ? Sl 15:15 0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2528 0.0 0.7 363360 52848 ? Sl 15:15 0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2514 0.0 0.6 361568 50580 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2501 0.0 0.6 361568 50812 ? Sl 15:15 0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2483 0.0 0.7 365380 55516 ? Sl 15:15 0:02 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2465 0.8 0.8 586604 62072 ? Sl 15:15 0:18 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1 2061 0.0 0.0 6040 2400 pts/0 Ss 15:05 0:00 /bin/bash
cmst1 4289 0.0 0.0 8636 1656 pts/0 R+ 15:53 0:00 \_ ps auxf
cmst1 1 0.0 0.0 6436 2612 ? Ss 15:04 0:00 /bin/bash ./run.sh
cmst1 629 0.0 0.0 2468 760 ? S 15:04 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-extra-file=/data/srv/wmagent/2.2.1rc2/config/mysql/my.cnf --datadir=/data/srv/wmage
cmst1 821 0.8 12.3 8606620 902992 ? Sl 15:04 0:24 \_ /usr/sbin/mariadbd --defaults-extra-file=/data/srv/wmagent/2.2.1rc2/config/mysql/my.cnf --basedir=/usr --datadir=/data/
cmst1 2058 0.0 0.0 6436 1464 ? S 15:05 0:00 /bin/bash ./run.sh
cmst1 4270 0.0 0.0 4272 628 ? S 15:53 0:00 \_ sleep 10
(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ manage status
+ Couch Status:
++ {"couchdb":"Welcome","uuid":"08e24bb5fe541033900035c5f2cf85fc","version":"1.6.1","vendor":{"version":"1.6.1","name":"The Apache Software Foundation"}}
+ Status of MySQL
++ MYSQL running with process: 821
++ Uptime: 699 Threads: 17 Questions: 5284 Slow queries: 0 Opens: 139 Open tables: 46 Queries per second avg: 7.559
Status of WMAgent:
Checking default database connection... ok.
Status components: ['WorkQueueManager', 'DBS3Upload', 'JobAccountant', 'JobCreator', 'JobSubmitter', 'JobTracker', 'JobStatusLite', 'JobUpdater', 'ErrorHandler', 'RetryManager', 'JobArchiver', 'TaskArchiver', 'AnalyticsDataCollector', 'ArchiveDataReporter', 'AgentStatusWatcher', 'RucioInjector']
Component:WorkQueueManager Running:2465
Component:DBS3Upload Running:2483
Component:JobAccountant Running:2501
Component:JobCreator Running:2514
Component:JobSubmitter Running:2528
Component:JobTracker Running:2551
Component:JobStatusLite Running:2570
Component:JobUpdater Running:2584
Component:ErrorHandler Running:2600
Component:RetryManager Running:2614
Component:JobArchiver Running:2628
Component:TaskArchiver Not Running
Component:AnalyticsDataCollector Not Running
Component:ArchiveDataReporter Not Running
Component:AgentStatusWatcher Not Running
Component:RucioInjector Not Running
[1]
cmst1@vocms0260:~/CMSKubernetes/docker/pypi/wmagent $ ./wmagent-docker-run.sh &
[1] 2210583
=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
- WMAgent Version : 2.2.1rc2
- WMAgent User : cmst1
- WMAgent Root path : /data
- WMAgent Host : vocms0260.cern.ch
- WMAgent TeamName : testbed-vocms0260
- WMAgent Number : 0
- WMAgent CentralServices : cmsweb-testbed.cern.ch
- WMAgent Relational DB type : mysql
- Python Verson : Python 3.8.16
- Python Module path : /usr/local/lib/python3.8/site-packages
=======================================================
-------------------------------------------------------
Start: Performing basic setup checks...
Done: Performing basic setup checks...
-------------------------------------------------------
check_wmasecrets: Checking for changes in the WMAgent.secrets file
check_wmasecrets: No change fund.
-------------------------------------------------------
Start: Performing checks for successful Docker initialisation steps...
WMA_BUILD_ID: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
dockerInitId: /data/admin/wmagent/hostadmin/.dockerInit
/data/srv/wmagent/current/config/.dockerInit
/data/srv/wmagent/current/config/couchdb/.dockerInit
/data/srv/wmagent/current/config/mysql/.dockerInit
/data/srv/wmagent/current/config/rucio/.dockerInit
/data/srv/wmagent/current/config/wmagent/.dockerInit
/data/srv/wmagent/current/install/.dockerInit
ERROR
-------------------------------------------------------
Start: Performing Docker image to Host initialisation steps
deploy_to_host: Initialise install
deploy_to_host: Initialise config
deploy_to_host: config service=wmagent
deploy_to_host: config service=mysql
deploy_to_host: config service=couchdb
deploy_to_host: config service=rucio
deploy_to_host: Initialise WMAgent.secrets
deploy_to_host: checking /data/admin/wmagent/hostadmin/WMAgent.secrets
Done: Performing Docker image to Host initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing local Docker image initialisation steps
deploy_to_container: Try Copying the host WMAgent.secrets file into the container admin area
deploy_to_container: Done
deploy_to_container: Updating WMAgent.secrets file with the current host's details
deploy_to_container: Double checking the final WMAgent.secrets file
deploy_to_container: Checking Certificates and Proxy
deploy_to_container: Checking Certificate lifetime:
deploy_to_container: Certifficate end date: Sep 7 12:04:12 2023 GMT
deploy_to_container: Checking myproxy lifetime:
deploy_to_container: myproxy end date: May 17 13:03:40 2023 GMT
deploy_to_container: OK
Done: Performing local Docker image initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing activate_agent
activate_agent: triggered.
Done: Performing activate_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_services
Starting Services...
starting couch...
CouchDB has not been initialised... running pre initialisation
Initialising CouchDB on 127.0.0.1:5984...
With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
Which couchdb: With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
CouchDB has not been initialised... running post initialisation
Starting mysql...
MySQL has not been initialised... running pre initialisation
Installing the mysql database area...
Installing MariaDB/MySQL system tables in '/data/srv/wmagent/2.2.1rc2/install/mysql/database' ...
OK
To start mariadbd at boot time you have to copy
support-files/mariadb.service to the right place for your system
Two all-privilege accounts were created.
One is root@localhost, it has no password, but you need to
be system 'root' user to connect. Use, for example, sudo mysql
The second is cmst1@localhost, it has no password either, but
you need to be the system 'cmst1' user to connect.
After connecting you can set the password, if you would need to be
able to connect as any of these users with a password and without sudo
See the MariaDB Knowledgebase at https://mariadb.com/kb
You can start the MariaDB daemon with:
cd '/usr' ; /usr/bin/mariadb-safe --datadir='/data/srv/wmagent/2.2.1rc2/install/mysql/database'
You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/share/mysql/mysql-test' ; perl mariadb-test-run.pl
Please report any problems at https://mariadb.org/jira
The latest information about MariaDB is available at https://mariadb.org/.
Consider joining MariaDB's strong and vibrant community:
https://mariadb.org/get-involved/
starting mysqld_safe...
Checking MySQL Socket file exists...
Socket file exists: /data/srv/wmagent/2.2.1rc2/install/mysql/logs/mysql.sock
MySQL has not been initialised... running post initialisation
Installing the mysql schema...
Socket file exists, proceeding with schema install...
Installing WMAgent Database: wmagent
Checking Server connection...
Connection OK
Done: Performing start_services
-------------------------------------------------------
-------------------------------------------------------
Start: Performing init_agent
init_agent: triggered.
Initialising Agent...
DEBUG:root:Log file ready
DEBUG:root:Using SQLAlchemy v.1.4.48
INFO:root:Instantiating base WM DBInterface
DEBUG:root:Tables for WMCore.WMBS created
DEBUG:root:Tables for WMCore.Agent.Database created
DEBUG:root:Tables for WMComponent.DBS3Buffer created
DEBUG:root:Tables for WMCore.BossAir created
DEBUG:root:Tables for WMCore.ResourceControl created
checking default database connection
default database connection tested
Installing FWJRDump into wmagent_jobdump/fwjrs
Installing FWJRDump app into database: http://localhost:5984/wmagent_jobdump%2Ffwjrs
Installing JobDump into wmagent_jobdump/jobs
Installing JobDump app into database: http://localhost:5984/wmagent_jobdump%2Fjobs
Installing WMStatsAgent into wmagent_summary
Installing WMStatsAgent app into database: http://localhost:5984/wmagent_summary
Installing SummaryStats into stat_summary
Installing SummaryStats app into database: http://localhost:5984/stat_summary
Setting up cron jobs for the job dump.
Installing WorkQueue into workqueue
Installing WorkQueue app into database: http://localhost:5984/workqueue
Installing WorkQueue into workqueue_inbox
Installing WorkQueue app into database: http://localhost:5984/workqueue_inbox
Done: Performing init_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_tweakconfig
agent_tweakconfig: triggered.
agent_tweakconfig: Making agent configuration changes needed for Docker
agent_tweakconfig: Making other agent configuration changes
Done: Performing agent_tweakconfig
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_resource_control
agent_resource_control: triggered.
agent_resource_control: Populating resource-control
agent_resource_control: Adding only T1 and T2 sites to resource-control...
Executing wmagent-resource-control --add-T1s --plugin=SimpleCondorPlugin --pending-slots=50 --running-slots=50 --down ...
Retrieved 7 maps from https://cms-cric.cern.ch/
Adding T1_US_FNAL to the resource control db...
Adding T1_DE_KIT to the resource control db...
Adding T1_ES_PIC to the resource control db...
Adding T1_FR_CCIN2P3 to the resource control db...
Adding T1_IT_CNAF to the resource control db...
Adding T1_RU_JINR to the resource control db...
Adding T1_UK_RAL to the resource control db...
Retrieved 16 PNNs from https://cms-cric.cern.ch/
Executing wmagent-resource-control --add-T2s --plugin=SimpleCondorPlugin --pending-slots=50 --running-slots=50 --down ...
Retrieved 52 maps from https://cms-cric.cern.ch/
Adding T2_CH_CERN_P5 to the resource control db...
Adding T2_CH_CERN_HLT to the resource control db...
Adding T2_CH_CSCS_HPC to the resource control db...
Adding T2_FR_GRIF_LLR to the resource control db...
Adding T2_FR_GRIF_IRFU to the resource control db...
Adding T2_US_Vanderbilt to the resource control db...
Adding T2_PL_Cyfronet to the resource control db...
Adding T2_AT_Vienna to the resource control db...
Adding T2_BE_IIHE to the resource control db...
Adding T2_BE_UCL to the resource control db...
Adding T2_BR_SPRACE to the resource control db...
Adding T2_BR_UERJ to the resource control db...
Adding T2_CH_CERN to the resource control db...
Adding T2_CH_CSCS to the resource control db...
Adding T2_CN_Beijing to the resource control db...
Adding T2_DE_DESY to the resource control db...
Adding T2_DE_RWTH to the resource control db...
Adding T2_EE_Estonia to the resource control db...
Adding T2_ES_CIEMAT to the resource control db...
Adding T2_ES_IFCA to the resource control db...
Adding T2_FI_HIP to the resource control db...
Adding T2_FR_GRIF to the resource control db...
Adding T2_FR_IPHC to the resource control db...
Adding T2_GR_Ioannina to the resource control db...
Adding T2_HU_Budapest to the resource control db...
Adding T2_IN_TIFR to the resource control db...
Adding T2_IT_Bari to the resource control db...
Adding T2_IT_Legnaro to the resource control db...
Adding T2_IT_Pisa to the resource control db...
Adding T2_IT_Rome to the resource control db...
Adding T2_KR_KISTI to the resource control db...
Adding T2_PK_NCP to the resource control db...
Adding T2_PL_Swierk to the resource control db...
Adding T2_PT_NCG_Lisbon to the resource control db...
Adding T2_RU_IHEP to the resource control db...
Adding T2_RU_INR to the resource control db...
Adding T2_RU_ITEP to the resource control db...
Adding T2_RU_JINR to the resource control db...
Adding T2_TR_METU to the resource control db...
Adding T2_TW_NCHC to the resource control db...
Adding T2_UA_KIPT to the resource control db...
Adding T2_UK_London_Brunel to the resource control db...
Adding T2_UK_London_IC to the resource control db...
Adding T2_UK_SGrid_Bristol to the resource control db...
Adding T2_UK_SGrid_RALPP to the resource control db...
Adding T2_US_Caltech to the resource control db...
Adding T2_US_Florida to the resource control db...
Adding T2_US_MIT to the resource control db...
Adding T2_US_Nebraska to the resource control db...
Adding T2_US_Purdue to the resource control db...
Adding T2_US_UCSD to the resource control db...
Adding T2_US_Wisconsin to the resource control db...
Retrieved 51 PNNs from https://cms-cric.cern.ch/
Done: Performing agent_resource_control
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_upload_config
agent_upload_config: triggered.
agent_upload_config: Tweaking central agent configuration befre uploading
agent_upload_config: Testbed agent, setting MaxRetries to 0...
*** Upload WMAgentConfig to AuxDB ***
Executing wmagent-upload-config {"MaxRetries":0} ...
Pushing the following agent configuration:
{'AgentDrainMode': False,
'CondorJobsFraction': 0.75,
'CondorOverflowFraction': 0.2,
'DiskUseThreshold': 85,
'IgnoreDisks': ['/mnt/ramdisk'],
'MaxRetries': 0,
'NoRetryExitCodes': [70,
73,
8001,
8006,
8009,
8023,
8026,
8501,
50660,
50661,
50664,
71102,
71104,
71105],
'SpeedDrainConfig': {'CondorPriority': {'Enabled': False, 'Threshold': 500},
'EnableAllSites': {'Enabled': False, 'Threshold': 200},
'NoJobRetries': {'Enabled': False, 'Threshold': 200}},
'SpeedDrainMode': False,
'UserDrainMode': False}
Done: Performing agent_upload_config
-------------------------------------------------------
-------------------------------------------------------
Start: Performing checks for successful Docker initialisation steps...
WMA_BUILD_ID: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
dockerInitId: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
OK
-------------------------------------------------------
Start: Performing local Docker image initialisation steps
deploy_to_container: Try Copying the host WMAgent.secrets file into the container admin area
deploy_to_container: Done
deploy_to_container: Updating WMAgent.secrets file with the current host's details
deploy_to_container: Double checking the final WMAgent.secrets file
deploy_to_container: Checking Certificates and Proxy
deploy_to_container: Checking Certificate lifetime:
deploy_to_container: Certifficate end date: Sep 7 12:04:12 2023 GMT
deploy_to_container: Checking myproxy lifetime:
deploy_to_container: myproxy end date: May 17 13:03:40 2023 GMT
deploy_to_container: OK
Done: Performing local Docker image initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_services
Starting Services...
starting couch...
Which couchdb: With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
Starting mysql...
starting mysqld_safe...
Checking MySQL Socket file exists...
Socket file exists: /data/srv/wmagent/2.2.1rc2/install/mysql/logs/mysql.sock
Checking Server connection...
Connection OK
Done: Performing start_services
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_agent
-------------------------------------------------------
Start sleeping now ...zzz...
@todor-ivanov , thanks for update and all details. There are extremely useful. Meanwhile, I want to make few observations:
Hi @vkuznet thnks for the feedback.
I think packaging MariaDB within WMAgent container is a mistake, and only leads to complexity at different levels
For the time being I did it that way, just to assemble everything in one place and make it run and to deliver an agent ready for testing. At a first glance this design may seem a complication, but it actually simplifies things a lot and gives some pretty good benefits and flexibility. One of which is being able to run multiple docker agents at the same machine. It would also help later if we decide to separate the wmagent
from the schedd
.
The initial plan was to use the database from the host, and not dealing with MariDB in this PR, but the possibility to connect to it were basically two - trough a socket or through the network interface. The socket is our default for the agent, but artificially sharing a sockets between the host and a container is a security breach. At the end the database installation was really easy and I also deliberately kept the commit related to the MariDB installation separate so that it all is in one place and we later can move that in a different Docker file, which could either be inherited within the agent or run as a separate container at the same host. There are few possible architectures the 3 main of which I can list (all with their benefits and drawbacks)
Given that the relational database is a monolithic piece of the agent which is actually under heavy stress and needs the least overhead, I do not see a good argument why should we separate it from the agent. When it comes to oracle there are other benefits and optimizations that Oracle provides and should worth paying the network price, but not sure how big that price is and if that would worth with MariaDB as well. I have not been around when such scale tests have been performed and eventual overhead being estimated, so this must be done before concluding with certainty here.
For the YUI part - I think we should completely get rid of it.
Oh, and I forgot to stress it enough. If we want to go towards real horizontal scaling, the place where we should slice is between the agent and the schedd, rather than thinking of breaking the whole agent in components. Our internal agent design we can remake later in many many ways. But the hard link between the agent and the schedd is the actual blocker for SI to be able get separate from the wmagent and implement their own schema of distributing the load among schedds, independently of the load in the agent. Those two loads have different nature. Even though coming from the same place, the system hits different limitations with different origins in these two. So it is also natural to think of implementing different scalability schemas in those levels. Something that in the CRAB system is marvelously done, and we should try it here as well.
FYI: @vkuznet @amaltaro @khurtado @klannon
Todor, it's great that you managed to start up most of the services. Thank you for this summary.
I have a few remarks to make:
++ {"couchdb":"Welcome","uuid":"08e24bb5fe541033900035c5f2cf85fc","version":"1.6.1","vendor":{"version":"1.6.1","name":"The Apache Software Foundation"}}
Performing start_services
...
This has been addressed with: https://github.com/dmwm/CMSKubernetes/pull/1393 Thanks Todor! Closing this one out.
Even though we considered the GH actions workflow, there was still a detailed that was missed and the CMSKubernetes PR actually broke our workflow: https://github.com/dmwm/WMCore/actions/runs/5427386861/jobs/9870613631#step:5:858
The important lines when building the docker image are:
---> 4389876fc8ff
Step 34/43 : ADD bin $WMA_DEPLOY_DIR/bin
ADD failed: file not found in build context or excluded by .dockerignore: stat bin: file does not exist
Error: Process completed with exit code 1.
I think the best option here would be to actually break the wmagent dockerfile into two (wmagent-base and wmagent), this way we can actually benefit from a more reproducible build as well. I might give it a try during the weekend.
@amaltaro I am trying to understand the actual origin of this error.
I do not happen to find any .dockerignore
file in my fork of the CMSKuberenetes
repository. And I am not quite sure, if splitting the image in two pieces may help in resolving this one here. Such an approach may of course be beneficial in terms of image size etc. , but I cannot see how would the situation change when it comes to this error.
Can you please add a pointer or two on where and how exactly we set up all the CI/CD processes? Do we have documentation on those?
@todor-ivanov as mentioned in our weekly meeting, the issue comes from the fact that in the GH actions we simply download the Dockerfile and try to build the image. As expected, it fails because we do not have the auxiliary scripts in the same directory (different than when we run it manually from a clone of CMSKubernetes).
The docker build workflow can be found in: https://github.com/dmwm/WMCore/blob/master/.github/workflows/docker_images_template.yaml#L28-L33
Hi @amaltaro, Thanks for providing this really crucial line in your previous comment. Indeed downloading just a Dockerfile
from the upstream CMSKubernetes
repository could be the reason for the missing substructure of directories (meaning bin
and etc
), which in this case are needed. Do you happen to know the quick answer to:
dmwm
area?Dockerfile
. This seems quite restrictive to me. It is basically preventing us from doing any ADD
command from inside any of the Dockerfile
s managing the image creation for WMCore
/WMAgent
services/components.Saying that, I am now even more confident that splitting the image creation in two steps with this PR: https://github.com/dmwm/CMSKubernetes/pull/1394 is not the solution of this exact problem. Splitting the image in two is beneficial in many other ways and I do support this. I really like this approach. But for solving the current error in question I suggest we fix the CI/CD actions.
If we insist on using external tools inside our CI/CD pipeline and github actions instead of git
or any other internally provided mechanism from github itself(if any), I may suggest a quick improvement by swapping web based tools like curl
with subversion
, and download a single directory. This [1] will give us the expected result. NOTICE the swap of tree/master
with trunk
.
There is more on the topic in this stackoverflow discussion: https://stackoverflow.com/a/18194523
[1]
$ svn checkout https://github.com/dmwm/CMSKubernetes/trunk/docker/pypi/wmagent
A wmagent/Dockerfile
A wmagent/Dockerfile.dist
A wmagent/README.md
A wmagent/bin
A wmagent/bin/manage
A wmagent/etc
A wmagent/etc/local.ini
A wmagent/etc/my.cnf
A wmagent/etc/rucio.cfg
A wmagent/install.sh
A wmagent/run.sh
A wmagent/wmagent-docker-build.sh
A wmagent/wmagent-docker-run.sh
Checked out revision 5881.
@todor-ivanov Todor, the curl tool doesn't bother me too much, we are using Ubuntu image for the GH actions and it should be fine.
However, the comment you made here https://github.com/dmwm/CMSKubernetes/pull/1393/files#r1251146936 about docker build argument is very helpful and I totally support it. That's how we should have got started with this!
Having said that, here is a WMCore PR updating the GH action workflow: https://github.com/dmwm/WMCore/pull/11638 and https://github.com/dmwm/CMSKubernetes/pull/1394
Lastly, a new image for wmagent-base:pypi-20230703
has been uploaded to CERN registry.
Hi @amaltaro, Thanks for adopting this approach for passing the $TAG as an external parameter to the Docker build command. This is indeed the proper approach, in my opinion as well.
But I do not understand this argument:
the curl tool doesn't bother me too much, we are using Ubuntu image for the GH actions and it should be fine.
I did not object the usage of the curl command, because I did not like it or it was somehow related to the type of image we are using inside the CI/CD pipeline. I objected it because it lacks the way of recursively fetching contents from a github project, given a sub directory from the project. In this case these sub directories are docker/pypi/*
. This will prevent all those Docker files from using any ADD
command. And indeed if one goes there and checks them, one would immediately notice none of them contains ADD
.
My point is - we should not violently mutilate the Docker containerization system functionality, just because we do not use the proper tool capable of doing recursive operations in our CI/CD pipeline.
(issue description has been completely refactored) Impact of the new feature WMAgent
Is your feature request related to a problem? Please describe. This is one of the first steps towards running WMAgent in a containerized environment.
Describe the solution you'd like With this issue, we are supposed to deliver:
Describe alternatives you've considered None
Additional context Depends on: https://github.com/dmwm/WMCore/issues/11565 The knowledge obtained must be propagated to: https://github.com/dmwm/WMCore/issues/11570 Part of the following meta issue: https://github.com/dmwm/WMCore/issues/11314