dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Build a WMAgent docker image #8797

Closed amaltaro closed 1 year ago

amaltaro commented 6 years ago

(issue description has been completely refactored) Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. This is one of the first steps towards running WMAgent in a containerized environment.

Describe the solution you'd like With this issue, we are supposed to deliver:

Describe alternatives you've considered None

Additional context Depends on: https://github.com/dmwm/WMCore/issues/11565 The knowledge obtained must be propagated to: https://github.com/dmwm/WMCore/issues/11570 Part of the following meta issue: https://github.com/dmwm/WMCore/issues/11314

ericvaandering commented 6 years ago

Correct. There are a number of containers here: https://github.com/dmwm/Docker

They might at least provide a starting point. However, these don't really follow the true containerization model. They tend to try to set up a "WMAgent node" meaning that MySQL is embedded in the same container as the Agent, etc. I could never uses these for unit tests as the IO of using MySQL's database files inside the container was not good enough.

But, once you start to separate things out, you'll be better off.

The containers in that repo are all built automatically on CERN's GitLab infrastructure and made available. We can add new ones if needed.

bbockelm commented 6 years ago

Correct - it'd be a great start, but we would want to separate out the dependent services (database, HTCondor) into separate containers -- or bind-mount them from the host.

amaltaro commented 5 years ago

@goughes Erik, as we discussed today, I have some initial work on my own repo: https://github.com/amaltaro/WMCore-Docker

vkuznet commented 2 years ago

I provided initial Dockerfile to build docker image and it was successfully build, see details in https://github.com/dmwm/WMCore/issues/11310. The individual dockerfiles can be found over here https://github.com/dmwm/CMSKubernetes/tree/master/docker/pypi

vkuznet commented 2 years ago

CMSKubernetes repo commits for WMCore Dockefiles:

todor-ivanov commented 1 year ago

Hi @amaltaro @vkuznet, As this issue (actually part of it) is marked quite high in the T0's section in the quarterly planing document, and we also agreed this should be turned into a meta issue, let me suggest to split it in the following set of issues and start adding bits of it to the 2023/Q2 project board, so we can progress on it:

Alan, Please correct me if I have missed something here or you do not agree with the suggested list. But one thing is for sure though, if we start with the first one we will be able to achieve both:

todor-ivanov commented 1 year ago

Hi @amaltaro To continue on this: We actually may already have a set of issues that are covering pieces of this plan as expressed in my previous comment. Here are those that I have found could be related to the respective bullets from above:

Some of those issue's descriptions may need efactoring to include missing pieces, but i am confident we can easily now create a single issue for the first step and move this one as a full meta issue pointing to this set of sub tasks. Please let me know what you think.

[1]

(WMCore.venv3) [user@unit02 WMCore.venv3]$ pip install wmagent
Collecting wmagent
  Downloading wmagent-2.2.0.2.tar.gz (1.2 MB)
     |████████████████████████████████| 1.2 MB 3.6 MB/s            
  Preparing metadata (setup.py) ... done
Collecting Cheetah3~=3.2.6.post1
...
Successfully built wmagent
Installing collected packages: cffi, pynacl, cryptography, bcrypt, pyrsistent, pyparsing, pycurl, MarkupSafe, docutils, contextlib2, SQLAlchemy, Sphinx, pyzmq, psutil, mysqlclient, httplib2, htcondor, cx-Oracle, coverage, CherryPy, Cheetah3, wmagent
...

(WMCore.venv3) [user@unit02 WMCore.venv3]$ ls -1 lib/python3.6/site-packages/WMComponent/
AgentStatusWatcher
AnalyticsDataCollector
ArchiveDataReporter
DBS3Buffer
ErrorHandler
__init__.py
JobAccountant
JobArchiver
JobCreator
JobStatusLite
JobSubmitter
JobTracker
JobUpdater
__pycache__
RetryManager
RucioInjector
TaskArchiver
WorkQueueManager
amaltaro commented 1 year ago

Thank you for looking into this, Todor.

Valentin and I had some discussion today on this project and here is a meta-issue and all its 10 sub-tasks: https://github.com/dmwm/WMCore/issues/11314

If you feel like GH issues need to be updated, please comment on each of them. If you feel like something is missing, please create a new one.

todor-ivanov commented 1 year ago

Thank @amaltaro. The actual two issues from the list in the metaissue: https://github.com/dmwm/WMCore/issues/11314, that I am interested in/planning to tackle in short term are the following two:

Those they go hand in hand and could be solved simultaneously I believe. And I am quite close to that point already. We did exchange some info with @germanfgv and he did help me by providing his experience on the matter with the T0 containers (thanks, German, for that). We also have those two previous efforts to produce a fully functional Docker container for the agent:

Provided by @vkuznet and @goughes. Both of those were delayed and did not get into production because of dependencies on solving other pieces from the deployment process migration (like moving away from RPM based deployment scripts or applying configuration files at runtime).

The way of handling the problem here, I can foresee (and I am already working on it), as a combination of the two approaches and split the deploy-wmagent.sh script in two pieces:

Some really good example on how those mount points should be used for credential propagation are listed in Erik's documentation to the previous docker image: https://github.com/dmwm/CMSKubernetes/tree/master/docker/wmagent#readme

todor-ivanov commented 1 year ago

Just a heads up.

In order to continue on that, some additional requests needs to be done to the VOC and to the SI Team, because we will need docker-ce installed in at least one of our test agents (if not all of them). And for that I am about to ask permission now in the relevant channels.

FYI: @vkuznet @amaltaro @khurtado

amaltaro commented 1 year ago

@todor-ivanov

https://github.com/dmwm/WMCore/issues/1156

You said this is one of the issues you would like to consider, but it is not related to what we are discussing here. Please let us know what is the correct issue number that you wanted to mention.

... because we will need docker-ce installed in a...

I've seen your email. For proper recording, can you please clarify which node you actually requested this package to be installed to?

todor-ivanov commented 1 year ago

Hi @amaltaro,

Please let us know what is the correct issue number that you wanted to mention.

Sorry for the typo the correct issue is:

For proper recording, can you please clarify which node you actually requested this package to be installed to?

Yes, the docker-ce package was installed on vocms0260 using the so provided puppet module from CERNIT and the non-root user allowed access to the docker engine is cmst1.

todor-ivanov commented 1 year ago

Just to mention here so we can correlate both issues. I believe in the process of solving the current issue we will obtain answers to most of the questions asked here: https://github.com/dmwm/WMCore/issues/11570 answered. Once I have more info, I will update the other issue as well.

FYI: @khurtado @amaltaro @vkuznet

todor-ivanov commented 1 year ago

Ok, now this is just a status report.

We have reached an interim state here.

With my latest commits at: https://github.com/dmwm/CMSKubernetes/pull/1364 I can already successfully start (modulo few components) a WMAgent from a docker image deployed from the wmagent package uploaded to pypi. The minimum package version that would work is: wmagent==2.2.1rc2 and no earlier.

The configuration is as follows:

Here is a complete and clean startup log: [1]. For starting the agent I used the so created wrapper: ./wmagent-docker-run.sh & without specifying any runtime parameters - using only defaults.

Here is the agent status from inside the container: [2]. Unfortunately some of the components were left down during agent startup because we had some broken dependency inside the python package, which I am investigating right now. Here is the log from manage agent-start [3].

NOTE: Even though we have reached a milestone - to be able to run and manage a basic container deployed from pypi, in order to call this PR: https://github.com/dmwm/CMSKubernetes/pull/1364 ready for review, there is some more work do be done, due to the several nasty hack which I had to implement so far. Here is the list of things to be done:

FYI: @amaltaro @vkuznet @khurtado

[3]

(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ manage start-agent
Starting WMAgent...
Checking default database connection... ok.
Starting components: ['WorkQueueManager', 'DBS3Upload', 'JobAccountant', 'JobCreator', 'JobSubmitter', 'JobTracker', 'JobStatusLite', 'JobUpdater', 'ErrorHandler', 'RetryManager', 'JobArchiver', 'TaskArchiver', 'AnalyticsDataCollector', 'ArchiveDataReporter', 'AgentStatusWatcher', 'RucioInjector']
Starting : WorkQueueManager
Starting WorkQueueManager as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/WorkQueueManager 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2465
Starting : DBS3Upload
DBS3Upload.__init__
Starting DBS3Upload as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/DBS3Upload 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2483
Starting : JobAccountant
Starting JobAccountant as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobAccountant 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2501
Starting : JobCreator
JobCreator.__init__
Starting JobCreator as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobCreator 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2514
Starting : JobSubmitter
Starting JobSubmitter as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobSubmitter 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2528
Starting : JobTracker
JobTracker.__init__
Starting JobTracker as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobTracker 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2551
Starting : JobStatusLite
JobStatusLite.__init__
Starting JobStatusLite as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobStatusLite 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2570
Starting : JobUpdater
Starting JobUpdater as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobUpdater 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2584
Starting : ErrorHandler
Starting ErrorHandler as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/ErrorHandler 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2600
Starting : RetryManager
Starting RetryManager as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/RetryManager 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2614
Starting : JobArchiver
JobArchiver.__init__
Starting JobArchiver as a daemon 
Log will be in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver 
Waiting 1 seconds, to ensure daemon file is created

started with pid 2628
Starting : TaskArchiver
Traceback (most recent call last):
  File "/data/srv/wmagent/current/install/wmagent/bin/wmcoreD", line 348, in <module>
    startup(config)
  File "/data/srv/wmagent/current/install/wmagent/bin/wmcoreD", line 221, in startup
    componentObject = factory.loadObject(classname = namespace, args = config)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMFactory.py", line 58, in loadObject
    module = __import__(module, globals(), locals(), [classname])
  File "/usr/local/lib/python3.8/site-packages/WMComponent/TaskArchiver/TaskArchiver.py", line 20, in <module>
    from WMComponent.TaskArchiver.CleanCouchPoller import CleanCouchPoller
  File "/usr/local/lib/python3.8/site-packages/WMComponent/TaskArchiver/CleanCouchPoller.py", line 28, in <module>
    from WMCore.DataStructs.MathStructs.DiscreteSummaryHistogram import DiscreteSummaryHistogram
ModuleNotFoundError: No module named 'WMCore.DataStructs.MathStructs'

[2]

cmst1@vocms0260:/data/srv/wmagent/current $ docker exec -it  wmagent /bin/bash
(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ 

(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ ps auxf 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
cmst1       2628  0.0  0.7 362592 52284 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2614  0.0  0.6 361824 50868 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2600  0.0  0.7 362528 51556 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2584  0.0  0.6 362080 51100 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2570  0.0  0.6 362080 51008 ?        Sl   15:15   0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2551  0.0  0.6 362080 51104 ?        Sl   15:15   0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2528  0.0  0.7 363360 52848 ?        Sl   15:15   0:01 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2514  0.0  0.6 361568 50580 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2501  0.0  0.6 361568 50812 ?        Sl   15:15   0:00 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2483  0.0  0.7 365380 55516 ?        Sl   15:15   0:02 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2465  0.8  0.8 586604 62072 ?        Sl   15:15   0:18 python /data/srv/wmagent/current/install/wmagent/bin/wmcoreD --start --config=/data/srv/wmagent/2.2.1rc2/config/wmagentpy3/
cmst1       2061  0.0  0.0   6040  2400 pts/0    Ss   15:05   0:00 /bin/bash
cmst1       4289  0.0  0.0   8636  1656 pts/0    R+   15:53   0:00  \_ ps auxf
cmst1          1  0.0  0.0   6436  2612 ?        Ss   15:04   0:00 /bin/bash ./run.sh
cmst1        629  0.0  0.0   2468   760 ?        S    15:04   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-extra-file=/data/srv/wmagent/2.2.1rc2/config/mysql/my.cnf --datadir=/data/srv/wmage
cmst1        821  0.8 12.3 8606620 902992 ?      Sl   15:04   0:24  \_ /usr/sbin/mariadbd --defaults-extra-file=/data/srv/wmagent/2.2.1rc2/config/mysql/my.cnf --basedir=/usr --datadir=/data/
cmst1       2058  0.0  0.0   6436  1464 ?        S    15:05   0:00 /bin/bash ./run.sh
cmst1       4270  0.0  0.0   4272   628 ?        S    15:53   0:00  \_ sleep 10

(WMAgent.dock) [cmst1@vocms0260:/data/srv/wmagent/current]$ manage status
+ Couch Status:
++ {"couchdb":"Welcome","uuid":"08e24bb5fe541033900035c5f2cf85fc","version":"1.6.1","vendor":{"version":"1.6.1","name":"The Apache Software Foundation"}}
+ Status of MySQL
++ MYSQL running with process: 821
++ Uptime: 699 Threads: 17 Questions: 5284 Slow queries: 0 Opens: 139 Open tables: 46 Queries per second avg: 7.559
Status of WMAgent:
Checking default database connection... ok.
Status components: ['WorkQueueManager', 'DBS3Upload', 'JobAccountant', 'JobCreator', 'JobSubmitter', 'JobTracker', 'JobStatusLite', 'JobUpdater', 'ErrorHandler', 'RetryManager', 'JobArchiver', 'TaskArchiver', 'AnalyticsDataCollector', 'ArchiveDataReporter', 'AgentStatusWatcher', 'RucioInjector']
Component:WorkQueueManager Running:2465
Component:DBS3Upload Running:2483
Component:JobAccountant Running:2501
Component:JobCreator Running:2514
Component:JobSubmitter Running:2528
Component:JobTracker Running:2551
Component:JobStatusLite Running:2570
Component:JobUpdater Running:2584
Component:ErrorHandler Running:2600
Component:RetryManager Running:2614
Component:JobArchiver Running:2628
Component:TaskArchiver Not Running
Component:AnalyticsDataCollector Not Running
Component:ArchiveDataReporter Not Running
Component:AgentStatusWatcher Not Running
Component:RucioInjector Not Running

[1]

cmst1@vocms0260:~/CMSKubernetes/docker/pypi/wmagent $ ./wmagent-docker-run.sh & 
[1] 2210583
=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
 - WMAgent Version            : 2.2.1rc2
 - WMAgent User               : cmst1
 - WMAgent Root path          : /data
 - WMAgent Host               : vocms0260.cern.ch
 - WMAgent TeamName           : testbed-vocms0260
 - WMAgent Number             : 0
 - WMAgent CentralServices    : cmsweb-testbed.cern.ch
 - WMAgent Relational DB type : mysql
 - Python  Verson             : Python 3.8.16
 - Python  Module path        : /usr/local/lib/python3.8/site-packages
=======================================================

-------------------------------------------------------
Start: Performing basic setup checks...

Done: Performing basic setup checks...
-------------------------------------------------------

check_wmasecrets: Checking for changes in the WMAgent.secrets file
check_wmasecrets: No change fund.
-------------------------------------------------------
Start: Performing checks for successful Docker initialisation steps...
WMA_BUILD_ID: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
dockerInitId: /data/admin/wmagent/hostadmin/.dockerInit
/data/srv/wmagent/current/config/.dockerInit
/data/srv/wmagent/current/config/couchdb/.dockerInit
/data/srv/wmagent/current/config/mysql/.dockerInit
/data/srv/wmagent/current/config/rucio/.dockerInit
/data/srv/wmagent/current/config/wmagent/.dockerInit
/data/srv/wmagent/current/install/.dockerInit
ERROR
-------------------------------------------------------
Start: Performing Docker image to Host initialisation steps
deploy_to_host: Initialise install
deploy_to_host: Initialise config
deploy_to_host: config service=wmagent
deploy_to_host: config service=mysql
deploy_to_host: config service=couchdb
deploy_to_host: config service=rucio
deploy_to_host: Initialise WMAgent.secrets
deploy_to_host: checking /data/admin/wmagent/hostadmin/WMAgent.secrets
Done: Performing Docker image to Host initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing local Docker image initialisation steps
deploy_to_container: Try Copying the host WMAgent.secrets file into the container admin area
deploy_to_container: Done
deploy_to_container: Updating WMAgent.secrets file with the current host's details
deploy_to_container: Double checking the final WMAgent.secrets file
deploy_to_container: Checking Certificates and Proxy
deploy_to_container: Checking Certificate lifetime:
deploy_to_container: Certifficate end date: Sep  7 12:04:12 2023 GMT
deploy_to_container: Checking myproxy lifetime:
deploy_to_container: myproxy end date: May 17 13:03:40 2023 GMT
deploy_to_container: OK
Done: Performing local Docker image initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing activate_agent
activate_agent: triggered.
Done: Performing activate_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_services
Starting Services...
starting couch...
CouchDB has not been initialised... running pre initialisation
Initialising CouchDB on 127.0.0.1:5984...
  With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
  With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
Which couchdb:   With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
  With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
CouchDB has not been initialised... running post initialisation
Starting mysql...
MySQL has not been initialised... running pre initialisation
Installing the mysql database area...
Installing MariaDB/MySQL system tables in '/data/srv/wmagent/2.2.1rc2/install/mysql/database' ...
OK

To start mariadbd at boot time you have to copy
support-files/mariadb.service to the right place for your system

Two all-privilege accounts were created.
One is root@localhost, it has no password, but you need to
be system 'root' user to connect. Use, for example, sudo mysql
The second is cmst1@localhost, it has no password either, but
you need to be the system 'cmst1' user to connect.
After connecting you can set the password, if you would need to be
able to connect as any of these users with a password and without sudo

See the MariaDB Knowledgebase at https://mariadb.com/kb

You can start the MariaDB daemon with:
cd '/usr' ; /usr/bin/mariadb-safe --datadir='/data/srv/wmagent/2.2.1rc2/install/mysql/database'

You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/share/mysql/mysql-test' ; perl mariadb-test-run.pl

Please report any problems at https://mariadb.org/jira

The latest information about MariaDB is available at https://mariadb.org/.

Consider joining MariaDB's strong and vibrant community:
https://mariadb.org/get-involved/

starting mysqld_safe...
Checking MySQL Socket file exists...
Socket file exists: /data/srv/wmagent/2.2.1rc2/install/mysql/logs/mysql.sock
MySQL has not been initialised... running post initialisation
Installing the mysql schema...
Socket file exists, proceeding with schema install...
Installing WMAgent Database: wmagent
Checking Server connection...
Connection OK
Done: Performing start_services
-------------------------------------------------------
-------------------------------------------------------
Start: Performing init_agent
init_agent: triggered.
Initialising Agent...
DEBUG:root:Log file ready
DEBUG:root:Using SQLAlchemy v.1.4.48
INFO:root:Instantiating base WM DBInterface
DEBUG:root:Tables for WMCore.WMBS created
DEBUG:root:Tables for WMCore.Agent.Database created
DEBUG:root:Tables for WMComponent.DBS3Buffer created
DEBUG:root:Tables for WMCore.BossAir created
DEBUG:root:Tables for WMCore.ResourceControl created
checking default database connection
default database connection tested
Installing FWJRDump into wmagent_jobdump/fwjrs
Installing FWJRDump app into database: http://localhost:5984/wmagent_jobdump%2Ffwjrs
Installing JobDump into wmagent_jobdump/jobs
Installing JobDump app into database: http://localhost:5984/wmagent_jobdump%2Fjobs
Installing WMStatsAgent into wmagent_summary
Installing WMStatsAgent app into database: http://localhost:5984/wmagent_summary
Installing SummaryStats into stat_summary
Installing SummaryStats app into database: http://localhost:5984/stat_summary
Setting up cron jobs for the job dump.
Installing WorkQueue into workqueue
Installing WorkQueue app into database: http://localhost:5984/workqueue
Installing WorkQueue into workqueue_inbox
Installing WorkQueue app into database: http://localhost:5984/workqueue_inbox
Done: Performing init_agent
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_tweakconfig
agent_tweakconfig: triggered.
agent_tweakconfig: Making agent configuration changes needed for Docker
agent_tweakconfig: Making other agent configuration changes
Done: Performing agent_tweakconfig
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_resource_control
agent_resource_control: triggered.
agent_resource_control: Populating resource-control
agent_resource_control: Adding only T1 and T2 sites to resource-control...
Executing wmagent-resource-control --add-T1s --plugin=SimpleCondorPlugin --pending-slots=50 --running-slots=50 --down ...
Retrieved 7 maps from https://cms-cric.cern.ch/
Adding T1_US_FNAL to the resource control db...
Adding T1_DE_KIT to the resource control db...
Adding T1_ES_PIC to the resource control db...
Adding T1_FR_CCIN2P3 to the resource control db...
Adding T1_IT_CNAF to the resource control db...
Adding T1_RU_JINR to the resource control db...
Adding T1_UK_RAL to the resource control db...
Retrieved 16 PNNs from https://cms-cric.cern.ch/
Executing wmagent-resource-control --add-T2s --plugin=SimpleCondorPlugin --pending-slots=50 --running-slots=50 --down ...
Retrieved 52 maps from https://cms-cric.cern.ch/
Adding T2_CH_CERN_P5 to the resource control db...
Adding T2_CH_CERN_HLT to the resource control db...
Adding T2_CH_CSCS_HPC to the resource control db...
Adding T2_FR_GRIF_LLR to the resource control db...
Adding T2_FR_GRIF_IRFU to the resource control db...
Adding T2_US_Vanderbilt to the resource control db...
Adding T2_PL_Cyfronet to the resource control db...
Adding T2_AT_Vienna to the resource control db...
Adding T2_BE_IIHE to the resource control db...
Adding T2_BE_UCL to the resource control db...
Adding T2_BR_SPRACE to the resource control db...
Adding T2_BR_UERJ to the resource control db...
Adding T2_CH_CERN to the resource control db...
Adding T2_CH_CSCS to the resource control db...
Adding T2_CN_Beijing to the resource control db...
Adding T2_DE_DESY to the resource control db...
Adding T2_DE_RWTH to the resource control db...
Adding T2_EE_Estonia to the resource control db...
Adding T2_ES_CIEMAT to the resource control db...
Adding T2_ES_IFCA to the resource control db...
Adding T2_FI_HIP to the resource control db...
Adding T2_FR_GRIF to the resource control db...
Adding T2_FR_IPHC to the resource control db...
Adding T2_GR_Ioannina to the resource control db...
Adding T2_HU_Budapest to the resource control db...
Adding T2_IN_TIFR to the resource control db...
Adding T2_IT_Bari to the resource control db...
Adding T2_IT_Legnaro to the resource control db...
Adding T2_IT_Pisa to the resource control db...
Adding T2_IT_Rome to the resource control db...
Adding T2_KR_KISTI to the resource control db...
Adding T2_PK_NCP to the resource control db...
Adding T2_PL_Swierk to the resource control db...
Adding T2_PT_NCG_Lisbon to the resource control db...
Adding T2_RU_IHEP to the resource control db...
Adding T2_RU_INR to the resource control db...
Adding T2_RU_ITEP to the resource control db...
Adding T2_RU_JINR to the resource control db...
Adding T2_TR_METU to the resource control db...
Adding T2_TW_NCHC to the resource control db...
Adding T2_UA_KIPT to the resource control db...
Adding T2_UK_London_Brunel to the resource control db...
Adding T2_UK_London_IC to the resource control db...
Adding T2_UK_SGrid_Bristol to the resource control db...
Adding T2_UK_SGrid_RALPP to the resource control db...
Adding T2_US_Caltech to the resource control db...
Adding T2_US_Florida to the resource control db...
Adding T2_US_MIT to the resource control db...
Adding T2_US_Nebraska to the resource control db...
Adding T2_US_Purdue to the resource control db...
Adding T2_US_UCSD to the resource control db...
Adding T2_US_Wisconsin to the resource control db...
Retrieved 51 PNNs from https://cms-cric.cern.ch/
Done: Performing agent_resource_control
-------------------------------------------------------
-------------------------------------------------------
Start: Performing agent_upload_config
agent_upload_config: triggered.
agent_upload_config: Tweaking central agent configuration befre uploading
agent_upload_config: Testbed agent, setting MaxRetries to 0...
*** Upload WMAgentConfig to AuxDB ***
Executing wmagent-upload-config {"MaxRetries":0} ...
Pushing the following agent configuration:
{'AgentDrainMode': False,
 'CondorJobsFraction': 0.75,
 'CondorOverflowFraction': 0.2,
 'DiskUseThreshold': 85,
 'IgnoreDisks': ['/mnt/ramdisk'],
 'MaxRetries': 0,
 'NoRetryExitCodes': [70,
                      73,
                      8001,
                      8006,
                      8009,
                      8023,
                      8026,
                      8501,
                      50660,
                      50661,
                      50664,
                      71102,
                      71104,
                      71105],
 'SpeedDrainConfig': {'CondorPriority': {'Enabled': False, 'Threshold': 500},
                      'EnableAllSites': {'Enabled': False, 'Threshold': 200},
                      'NoJobRetries': {'Enabled': False, 'Threshold': 200}},
 'SpeedDrainMode': False,
 'UserDrainMode': False}
Done: Performing agent_upload_config
-------------------------------------------------------
-------------------------------------------------------
Start: Performing checks for successful Docker initialisation steps...
WMA_BUILD_ID: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
dockerInitId: 68701503249744219753aea0c5924c8b274aa00de917f8a04144aae8c8972b47
OK
-------------------------------------------------------
Start: Performing local Docker image initialisation steps
deploy_to_container: Try Copying the host WMAgent.secrets file into the container admin area
deploy_to_container: Done
deploy_to_container: Updating WMAgent.secrets file with the current host's details
deploy_to_container: Double checking the final WMAgent.secrets file
deploy_to_container: Checking Certificates and Proxy
deploy_to_container: Checking Certificate lifetime:
deploy_to_container: Certifficate end date: Sep  7 12:04:12 2023 GMT
deploy_to_container: Checking myproxy lifetime:
deploy_to_container: myproxy end date: May 17 13:03:40 2023 GMT
deploy_to_container: OK
Done: Performing local Docker image initialisation steps
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_services
Starting Services...
starting couch...
Which couchdb:   With installation directory: /data/srv/wmagent/2.2.1rc2/install/couchdb
  With configuration directory: /data/srv/wmagent/2.2.1rc2/config/couchdb
Starting mysql...
starting mysqld_safe...
Checking MySQL Socket file exists...
Socket file exists: /data/srv/wmagent/2.2.1rc2/install/mysql/logs/mysql.sock
Checking Server connection...
Connection OK
Done: Performing start_services
-------------------------------------------------------
-------------------------------------------------------
Start: Performing start_agent
-------------------------------------------------------
Start sleeping now ...zzz...
vkuznet commented 1 year ago

@todor-ivanov , thanks for update and all details. There are extremely useful. Meanwhile, I want to make few observations:

todor-ivanov commented 1 year ago

Hi @vkuznet thnks for the feedback.

I think packaging MariaDB within WMAgent container is a mistake, and only leads to complexity at different levels

For the time being I did it that way, just to assemble everything in one place and make it run and to deliver an agent ready for testing. At a first glance this design may seem a complication, but it actually simplifies things a lot and gives some pretty good benefits and flexibility. One of which is being able to run multiple docker agents at the same machine. It would also help later if we decide to separate the wmagent from the schedd.

The initial plan was to use the database from the host, and not dealing with MariDB in this PR, but the possibility to connect to it were basically two - trough a socket or through the network interface. The socket is our default for the agent, but artificially sharing a sockets between the host and a container is a security breach. At the end the database installation was really easy and I also deliberately kept the commit related to the MariDB installation separate so that it all is in one place and we later can move that in a different Docker file, which could either be inherited within the agent or run as a separate container at the same host. There are few possible architectures the 3 main of which I can list (all with their benefits and drawbacks)

Given that the relational database is a monolithic piece of the agent which is actually under heavy stress and needs the least overhead, I do not see a good argument why should we separate it from the agent. When it comes to oracle there are other benefits and optimizations that Oracle provides and should worth paying the network price, but not sure how big that price is and if that would worth with MariaDB as well. I have not been around when such scale tests have been performed and eventual overhead being estimated, so this must be done before concluding with certainty here.

For the YUI part - I think we should completely get rid of it.

todor-ivanov commented 1 year ago

Oh, and I forgot to stress it enough. If we want to go towards real horizontal scaling, the place where we should slice is between the agent and the schedd, rather than thinking of breaking the whole agent in components. Our internal agent design we can remake later in many many ways. But the hard link between the agent and the schedd is the actual blocker for SI to be able get separate from the wmagent and implement their own schema of distributing the load among schedds, independently of the load in the agent. Those two loads have different nature. Even though coming from the same place, the system hits different limitations with different origins in these two. So it is also natural to think of implementing different scalability schemas in those levels. Something that in the CRAB system is marvelously done, and we should try it here as well.

FYI: @vkuznet @amaltaro @khurtado @klannon

amaltaro commented 1 year ago

Todor, it's great that you managed to start up most of the services. Thank you for this summary.

I have a few remarks to make:

  1. From your logs, you seem to have installed an ancient WMAgent package to get CouchDB available in the host (you likely deployed the wmagent package instead of wmagentpy3). I would highly recommend you to remake this setup with the latest stable releases for CouchDB/wmagentpy3)
    ++ {"couchdb":"Welcome","uuid":"08e24bb5fe541033900035c5f2cf85fc","version":"1.6.1","vendor":{"version":"1.6.1","name":"The Apache Software Foundation"}}
  2. I have the impression that your current setup initializes services twice. During the init process, it already starts up mariadb and couchdb, but this line seems to repeat it again:
    Performing start_services
    ...
  3. About MariaDB, I agree with Valentin that we should try to separate it from the WMAgent image (and this is what we have been discussing so far, having 3 containers in total: MariaDB, CouchDB, WMAgent). In addition, when we run WMAgent at CERN we do not actually need MariaDB. If you are concerned that T0 will not manage to run multiple replays in the same node, I think we will have to come back to that in the future. Having the ability to stop components without affecting services is extremely important IMO. In any case, it is fine to have MariaDB in right now as you run tests as real as possible.
  4. It might be beneficial to isolate components in their own docker containers. But it's definitely not how we planned it to go at this stage. I also think it's important that we start it this way to: reduce the boostrap overhead in this new model and to get experience with this system.
amaltaro commented 1 year ago

This has been addressed with: https://github.com/dmwm/CMSKubernetes/pull/1393 Thanks Todor! Closing this one out.

amaltaro commented 1 year ago

Even though we considered the GH actions workflow, there was still a detailed that was missed and the CMSKubernetes PR actually broke our workflow: https://github.com/dmwm/WMCore/actions/runs/5427386861/jobs/9870613631#step:5:858

The important lines when building the docker image are:

  ---> 4389876fc8ff
Step 34/43 : ADD bin $WMA_DEPLOY_DIR/bin
ADD failed: file not found in build context or excluded by .dockerignore: stat bin: file does not exist
Error: Process completed with exit code 1.

I think the best option here would be to actually break the wmagent dockerfile into two (wmagent-base and wmagent), this way we can actually benefit from a more reproducible build as well. I might give it a try during the weekend.

todor-ivanov commented 1 year ago

@amaltaro I am trying to understand the actual origin of this error.

I do not happen to find any .dockerignore file in my fork of the CMSKuberenetes repository. And I am not quite sure, if splitting the image in two pieces may help in resolving this one here. Such an approach may of course be beneficial in terms of image size etc. , but I cannot see how would the situation change when it comes to this error.

Can you please add a pointer or two on where and how exactly we set up all the CI/CD processes? Do we have documentation on those?

amaltaro commented 1 year ago

@todor-ivanov as mentioned in our weekly meeting, the issue comes from the fact that in the GH actions we simply download the Dockerfile and try to build the image. As expected, it fails because we do not have the auxiliary scripts in the same directory (different than when we run it manually from a clone of CMSKubernetes).

The docker build workflow can be found in: https://github.com/dmwm/WMCore/blob/master/.github/workflows/docker_images_template.yaml#L28-L33

todor-ivanov commented 1 year ago

Hi @amaltaro, Thanks for providing this really crucial line in your previous comment. Indeed downloading just a Dockerfile from the upstream CMSKubernetes repository could be the reason for the missing substructure of directories (meaning bin and etc), which in this case are needed. Do you happen to know the quick answer to:

Saying that, I am now even more confident that splitting the image creation in two steps with this PR: https://github.com/dmwm/CMSKubernetes/pull/1394 is not the solution of this exact problem. Splitting the image in two is beneficial in many other ways and I do support this. I really like this approach. But for solving the current error in question I suggest we fix the CI/CD actions.

If we insist on using external tools inside our CI/CD pipeline and github actions instead of git or any other internally provided mechanism from github itself(if any), I may suggest a quick improvement by swapping web based tools like curl with subversion, and download a single directory. This [1] will give us the expected result. NOTICE the swap of tree/master with trunk.

There is more on the topic in this stackoverflow discussion: https://stackoverflow.com/a/18194523

[1]

$ svn checkout https://github.com/dmwm/CMSKubernetes/trunk/docker/pypi/wmagent
A    wmagent/Dockerfile
A    wmagent/Dockerfile.dist
A    wmagent/README.md
A    wmagent/bin
A    wmagent/bin/manage
A    wmagent/etc
A    wmagent/etc/local.ini
A    wmagent/etc/my.cnf
A    wmagent/etc/rucio.cfg
A    wmagent/install.sh
A    wmagent/run.sh
A    wmagent/wmagent-docker-build.sh
A    wmagent/wmagent-docker-run.sh
Checked out revision 5881.
amaltaro commented 1 year ago

@todor-ivanov Todor, the curl tool doesn't bother me too much, we are using Ubuntu image for the GH actions and it should be fine.

However, the comment you made here https://github.com/dmwm/CMSKubernetes/pull/1393/files#r1251146936 about docker build argument is very helpful and I totally support it. That's how we should have got started with this!

Having said that, here is a WMCore PR updating the GH action workflow: https://github.com/dmwm/WMCore/pull/11638 and https://github.com/dmwm/CMSKubernetes/pull/1394

Lastly, a new image for wmagent-base:pypi-20230703 has been uploaded to CERN registry.

todor-ivanov commented 1 year ago

Hi @amaltaro, Thanks for adopting this approach for passing the $TAG as an external parameter to the Docker build command. This is indeed the proper approach, in my opinion as well.

But I do not understand this argument:

the curl tool doesn't bother me too much, we are using Ubuntu image for the GH actions and it should be fine.

I did not object the usage of the curl command, because I did not like it or it was somehow related to the type of image we are using inside the CI/CD pipeline. I objected it because it lacks the way of recursively fetching contents from a github project, given a sub directory from the project. In this case these sub directories are docker/pypi/*. This will prevent all those Docker files from using any ADD command. And indeed if one goes there and checks them, one would immediately notice none of them contains ADD.

My point is - we should not violently mutilate the Docker containerization system functionality, just because we do not use the proper tool capable of doing recursive operations in our CI/CD pipeline.