dmwm / CRABServer

15 stars 38 forks source link

Migrate CRAB to PYPI ecosystem #7560

Open novicecpp opened 1 year ago

novicecpp commented 1 year ago

In short, WMCore already moved to PYPI on production, and we are happy to move too. But we need to discuss with WMCore developer on how much effort is needed and whether they can give documentation and some support.


Below are info moved from meeting minutes:

novicecpp commented 10 months ago

I am ready to open POC PR.

Design choice of the POC in summary:

novicecpp commented 10 months ago

Next tasks:

novicecpp commented 9 months ago

Lets me explain the poc of PyPI https://github.com/dmwm/CRABServer/pull/8088 here. First of all, forget the design choice https://github.com/dmwm/CRABServer/issues/7560#issue-1607196899 for now. I will come back to it later.

The goal is "Remove RPM build".

Because we remove the RPM "building system", we need to come up with our own replacement. What I actually do is convert crabtaskworker.spec to our own building system, which is based on the PyPI containers from WMCore. Note that I did not touch the Python packaging, which is worth tracking in a separate issue, but the image building will involve the CI to some extent.

Things I keep in mind while developing this:

  1. The less code change in src/python, the better.
  2. No clone CRABServer, use local file. This will simplify how we run on local machine and also in CI. If there is a change, the builder will always pick from local files. No need to commit and push. And for the CI, in the gitlab-ci model, CI jobs always run on the commit. No need to pass repo/branch/commit to the CI. Just push, and then CI will run.
  3. If we want some files on the internet, commit the endpoint in this repo. For example, the WMCore version I want to install is in cicd/crabserver_pypi/wmcore_requirements.txt. Any format is fine as long as you commit it and CI can pick it up.
  4. It should be able to run in your local machine.

So here is what I came up with:

  1. Building Simply run single docker build command from repo's root directory:

    # go to git root directory
    cd $(git rev-parse --show-toplevel) 
    # build crabserver
    docker build -t registry.cern.ch/cmscrab/crabserver:manual . -f cicd/crabserver_pypi/Dockerfile
    # build taskworker
    docker build -t registry.cern.ch/cmscrab/crabtaskworker:manual . -f cicd/crabtaskworker_pypi/Dockerfile
  2. Running Because of Dockerfile in both image are a bit long, I will show how I run the container first, hope it will help you understand Dockerfile better.

    • TaskWorker: ENTRYPOINT is tini wrapper (please see explanation in the design choice), and CMD is executing crabtaskworker_pypi/run.sh. run.sh call crabtaskworker_pypi/start.sh which spawn TaskWorker process in background, then run sleep forever.
    • CRABServer: CRABServer is a bit complicated because it needs to be integrated with the K8S manifest. There is a setup step created by CMSWEB operators. I adapt it and put it together in crabserver_pypi/entrypoint.sh and make this script as ENTRYPOINT. Basically, container start to run entrypoint.sh script until line 53, then exec the crabserver_pypi/run.sh which will spawn the crabserver process and run cat on the fifo log forever.
  3. Dockerfile Dockerfile should be a self-explanatory step, a.k.a., how do I build the container. Please see "Design Choice" for the sort of question "Why is it doing this and that?". And please let me know if I did not make it clear enough for you to understand.

novicecpp commented 9 months ago

There are some changes on how I install WMCore code, in order to support forked version. I will update the PR later.

novicecpp commented 8 months ago

Here is the guide on how do I build and run CRAB REST image. (Bear with me, it is a bit long. CRAB TaskWorker images will be in another reply).

To build it, simply run:

docker build -t registry.cern.ch/cmscrab/crabserver:manual . -f cicd/crabserver_pypi/Dockerfile

Where working directory is path/to/crabserver/repository.

The Dockerfile is straightforward:

Then we execute wmc-httpd to start REST service. Note that I simply copy code from source to /data/srv/current/lib/python3.8/site-packages and export PYTHONPATH to this directory in manage. This will change to PyPI packaging and install via pip in the future.


There is a "deployment convention" for WMCore/REST services that specifies how services read/write files (logs/credential/state/config/etc.). This will explains why we need this Dockerfile lines 42-46 and also has an effect on how we deploy in the Kubernetes environment, which I will explain in a later section.

There are 4 directories where the service read or write files.

  1. LOGDIR=/data/srv/logs/crabserver This is where services write log to. In version 2022 and earlier, this directory is mount to CephFS. But after we switch to writing log to stdout we did not use it anymore.

  2. AUTHDIR=/data/srv/current/auth/crabserver This is where we store CRABServerAuth.py, robotcert/hmac/proxy(?) that setup by CMSWEB. Services read this file via exporting PYTHONPATH and run.sh script preventing copy config.py to this directory.

  3. STATEDIR=/data/srv/state/crabserver This is where services write temporary file into (-d arg of wmc-httpd in manage#L51) We also store fifo file crabserver-fifo to let service write logs to.

  4. CFGDIR=/data/srv/current/config/crabserver This is where config.py is stored. Service only config.py via positional argument of wmc-httpd.

In the new PyPI image, the convention and path are still the same. But, replace the old manage script to new manage where the new one is based on the new manage in dmwm-base image.


The next part is how we deploy this CRAB REST on CMSWEB Kuberenetes cluster.

Currently:

The new PyPI still use same way but replace upper part of old run.sh with new entrypoint.sh. Then setup Kubernetes manifest file in container specs section will look like this:

containers:
  - image: registry.cern.ch/crab/crabserver:pypi-2ea9d1429ebe56c527b53f17b4adb4503a4f095f
    name: crabserver
    command:
      - /data/entrypoint.sh
    args:
      - /data/run.sh
    ...

New process execution flow will be:


Finally, the dependencies.

I use wmagent-base, which based on dmwm-base, as baseimage of CRAB REST. This image provides:

For python dependencies install via pip in requirements.txt, I strip it from WMCore requirements.txt, by do trial-and-error adding package to our requirements.txt until no more dependency error.

WMCore version is specified in wmcore_requirements.txt. It will read by requirementsParse.py which print space-separate repository and version (tag). The current POC only get version number and pass it to pip, cannot replace with src in custom repository yet.


I explicitly not pass any argument to docker build but commit it as files (e.g. wmcore_requiements.txt). This way CI can notify changes and build the new image automatically. And ./start.sh -g or equivalent feature is not available at the moment.

novicecpp commented 8 months ago

Now it is the time for the guide of CRAB TaskWorker image.

To build it, same as CRAB REST, simply run:

docker build -t registry.cern.ch/cmscrab/crabtaskworker:manual . -f cicd/crabtaskworker_pypi/Dockerfile

Where working directory is path/to/crabserver/repository.

But this time the Dockerfile is a bit long and has many dependencies. Let jump to line 42 where main image start. It actually quite straightforward:


**gfal-***

In new PyPI image, WMCore build gfal docker on separate Dockerfile using miniconda. When install in reqmgr2ms-unmerged, they simply copy whole miniconda directory from gfal image to this image and export PATH/PYTHONPATH to miniconda directory.

In our image we can do the same, however WMCore gfal does not come with gfal2-util which provides cli gfal-copy, gfal-ls, etc. In Dockerfile line 29-36, I simply clone gfal2-util and run python setup.py install, then in the main image (at lines 49-53), I copy miniconda directory and symlink to /usr/bin/ path without export PYTHONPATH, and it works out of the box (I do not understand yet why).

There is gfal2-util in conda-forge, I will have a look later.


python-ldap

CRAB has feature that users in cms-crab-HighPrioUsers egroup submit the jobs to highprio of accounting_group which have higher priority than usual analysis group. CRAB query the egroup members through CERN LDAP in CMSGroupMapper using python-ldap python library.

To install python-ldap via pip, libsasl2-dev python3-dev libldap-dev libssl-dev from debian 11 repository (OS of wmagent-base) is required. However, libldap 2.4 that come with debian 11 is built with gnutls which does not support cacertdir option. So code changes in CMSGroupMapper is needed. Fortunately, python>3.9 docker images come with Debian 12 and libldap is compiled with openssl that support caccertdir (need to be confirmed, but that what I remembered the last time I tried to debug LDAP error).

For LDAP client config, we can fetch directly from /etc/openldap of cc7-base image provided by CERN IT.

Note that I am not sure if this feature is still in-use (our code still support though).


Dependencies

I use wmagent-base, which based on dmwm-base, as baseimage of CRAB REST. This image provides:

(if I understand correctly, we did not need voms-proxy-* but it is included in wmagent-base anyway)

For python dependencies install via pip in requirements.txt, I am kinda lazy. I clone WMCore requirements.txt, remove gfal2-python and add python-ldap, install with pip, and...it works.


TaskWorker data files.

The most complicate part of our TaskWorker deployment is how we build data files. The data files contains TaskManagerRun.tar.gz, CMSRunAnalysis.tar.gz, and other scripts which use for submit jobs to the condor.

According to crabtaskworker.spec (and assume %i is ./install_dir directory):

For the new PYPI image, those steps are still the same but get simplify by remove RPM-relate line and ignore updateTMRuntime.sh feature (for now), describe in Dockerfile line 1-26:

This new_htcondor_make_runtime.sh still need to rework. For example, define every steps in setup.py and run single command python setup.py install_system that create /data directory ready to use.

Note for inconsistency with crabserver_pypi/Dockerfile which copy data files to /data/srv/current/data. According to setuptools docs, they suggested to put data files in package directory, which should be like /data/srv/current/lib/python3.8/site-packages/CRABServer/data. I will fix it later.


To start TW process, I have new start.sh in-place of old start.sh. It setup necessary environment variables then execute MaterWorker.py

Process execution flow is simple:


The ./start.sh -g and env.sh still not yet available. Also running Publisher process still not support.

belforte commented 8 months ago

about LDAP, we still need to support it even if it was/is only used in case of "emergency". I never liked this dependency anyhow. Maybe there is some other way to get the list of members of that e-group ? Maybe we can add that egroup to CRIC, let it do the ldap magic, and fetch the list from CRIC with a simple curl ?

novicecpp commented 8 months ago

Maybe we can add that egroup to CRIC, let it do the ldap magic, and fetch the list from CRIC with a simple curl ?

Thanks Stefano. That is interesting, never thought about it. I am adding it to next to-do list.

novicecpp commented 8 months ago

Thanks to Stefano and @Panos512, now we can list users in cms-crab-HighPrioUsers through CRIC. We can easily extend CRIC class from WMCore by add new method similar to _CRICUserQuery().

novicecpp commented 7 months ago

One things I forgot to mention in CRAB REST image is, where WMCore/REST get those data files of CRAB UI?

The answer is this file, FrontPage.py, the absolute path of ./data/{html,script,css} is defined in this file. The magic is come from line 13. For example,

So, as long as you deploy data directory 2 level upper of PYTHONPATH (in this case same level as lib directory of lib/python3.8/site-packages), no matter the path of PYTHONPATH it is, it always works.

That is why Stefano added this hack to addGH.sh to make web UI works in GH mode. I think new PYPI image will also use this hack and fix it later when we support install CRAB with pip.

belforte commented 7 months ago

abut the previous comment https://github.com/dmwm/CRABServer/issues/7560#issuecomment-1941698865 it goes w/o saying that any idea about how to simplify is welcome. I can sort of understand the original implementor line of thought that there is a generic application in /data/srv/.../myApp and below that there's lib with the code and data with files including html, but we would surely prefer that location in the server and location in the GH tree are more closely related.

novicecpp commented 7 months ago

I want to move it into data files and refer the path from PYTHONPATH. Also, pip will handle update for us when we install updated version.

novicecpp commented 5 months ago

For the last part, the Publisher.

It is simpler than I thought. No extra environment variables is needed. Startup scripts can copy-paste from TaskWorker (start.sh/stop.sh/manage.sh in crabtaskworker_pypi) then tailored to suit the Publisher's needs.

However, there are 2 major changes and 1 discovered bug I want to introduce (compare to the code in the current master).

  1. Share run.sh like RPM image.

I am reinvent the wheel here :sadge: New run.sh is placed in /data/run.sh and the decision to execute which start.sh is depended on $SERVICE (e.g., /data/srv/${SERVICE}/start.sh, just like RPM run.sh). I tried to separate it once, but the similarity of the code is 80%.

  1. Symlink is done in the run.sh step. Not building a container step.

In current master, I have done this in the build step (Dockerfile#L104-107), but then realized that there is an extra directory that needs to be symlinked for Publisher (PublisherFiles).

  1. Bug(?) in the original Publisher/stop.sh: cannot start service if the time of the next cycle has passed.

This is what I discovered when I tried to run Publisher PyPI for the first time. The problem is the result of 2 combinations:

We can fix the condition, and everything will be fine. But I think the function is a bit fragile, and there should be a better way to do this. So, I am not fixing it. I separate start and restart command so start action only runs start_srv function instead.


PR is coming soon.

belforte commented 5 months ago

I guess we never tried until now to stop a Publisher which had never ran ! Fine with the change, but you need to put some protection in the new start so that it fails if Publisher is already running or falls back to restart

With a bit a of day-after-wisdom, I wander if we have too many symlinks and should rather re-arrange the directory tree. All those directory names are configuration variables, should be easy to rename them.

novicecpp commented 5 months ago

With a bit a of day-after-wisdom, I wander if we have too many symlinks and should rather re-arrange the directory tree. All those directory names are configuration variables, should be easy to rename them.

IMO, the symlink that we still need is the config file, where we can switch to other files on-the-fly without stopping puppet. Other than that, it can be replaced by configuration+docker run -v /host:/container (need to modify runContainer.sh).

novicecpp commented 5 months ago

With https://github.com/dmwm/CRABServer/pull/8368/commits/c547da07104247eda74761db8e1261f285ed7da5, first line of DOCKER_VOL will expand to something like this:

+ DOCKER_VOL='-v /data/container/:/data/hostdisk/ -v /data/container/Publisher_schedd/cfg:/data/srv/Publisher/cfg -v /data/container/Publisher_schedd/logs:/data/srv/Publisher/logs -v /data/container/Publisher_schedd/PublisherFiles:/data/srv/Publisher/PublisherFiles -v /data/srv/tmp/:/data/srv/tmp/'

Then, switch the config.General.taskFilesDir to /data/srv/Publisher/PublisherFiles on Publisher_scheddConfig.py#L35.

belforte commented 5 months ago

thanks @novicecpp , given what you wrote in previous comment, which I fully agree with, I think that we do not need all of those mount either. Shall we open an ad-hoc thread and review this ?

belforte commented 5 months ago

I guess we never tried until now to stop a Publisher which had never ran ! Fine with the change, but you need to put some protection in the new start so that it fails if Publisher is already running or falls back to restart

With a bit a of day-after-wisdom, I wander if we have too many symlinks and should rather re-arrange the directory tree. All those directory names are configuration variables, should be easy to rename them.

novicecpp commented 4 months ago

I encountered this problem today after stop Publisher for too long to make sure TW is up and running. Let's create the new issue to tackle this.

novicecpp commented 1 month ago

I think the last part here is documentation.

I remembered it is hard for Vijay to figure it out even we already have a new, simpler Dockerfile. It is not like we pip install crabtaskworker and done. There are specific paths in container that we use and specific way to run containers.

At least give a new guy a map where to look at.