dmwm / CRABServer

15 stars 38 forks source link

review new WMCore Runtim Doc and update code as needed #7084

Closed belforte closed 1 year ago

belforte commented 2 years ago

see https://github.com/dmwm/WMCore/issues/10970#issuecomment-1039742503 and especially https://github.com/dmwm/WMCore/wiki/Notes-about-environment-variables-passed-to-the-Scram-environment-or-modified-when-running-the-CMSSW-executable

belforte commented 2 years ago

In the end the explanation of all the issues I had with environment is a (big) initial misunderstanding. I thought that the job payload is executed inside Scram() in a clean environment, just since it was done that way (but in the COMP) environment in the past. Instead as @khurtado explained

cmsRun is executed through a subprocess call (+ a wrapper for the CMSSW setup) and not with Scram() , those variables are not lost (which is why the WM PYTHONPATH is removed by hand before executing it, but then put back for other stuff that needs it afterwards in the executor)

Also, quoting from https://github.com/dmwm/WMCore/wiki/Notes-about-environment-variables-passed-to-the-Scram-environment-or-modified-when-running-the-CMSSW-executable , :

The CMSSW executable (cmsRun) is not executed directly by the executor with Scram(). Instead, it runs a wrapper script that setups its own environment (with a name stepName-main.sh. E.g.: cmsRun1-main.sh). This bash script is written on the fly and defined HERE.

Since this wrapper script is setting up its own scram environment, the script itself is called by simply using a subprocess call. However, some environment variables need to be overridden in order to avoid problems between the CMSSW environment and the WM environment.

The environment override is defined HERE and basically make sure that:

- The WM PYTHONPATH is not passed in the subprocess call. _- It sets XRD_LOADBALANCERTTL to workaround a problem at CERN related to the GSI authentication plugin and EOS with XRootD - It sets the HOME environment variable_

After all these changes in the environment, the CMSSW executable is invoked through this wrapper script HERE However, since we clean up the WM PYTHONPATH in the os system, other steps (e.g.: in the stepChain workflow) would fail after this if they can't find the WM libraries, so the original WM PYTHONPATH is put back in the environment after calling the cmssw executable/cmsRun wrapper.

So this is significantly different from current CRAB approach, and we need to decide if to change CRAB Job Wrapper to follow more closely what WMCore does and how much closely since we do not want to reproduce all WMA Step machinery. Maybe as simple as use Scram(envCmd=...) ( from HERE ) to cleanup $PYTHONPATH and then run Scram() in the job start environment ? @amaltaro @khurtado @mapellidario @dciangot your input is more than welcome !

By the way, currently CRAB does nothing about

set XRD_LOADBALANCERTTL to workaround a problem at CERN related to the GSI authentication plugin and EOS with XRootD

how much worried should we be ? Does anybody know what this thing is ? Is there any document/ticket/issue/elog about it ?

khurtado commented 2 years ago

@belforte It looks like XRD_LOADBALANCERTTL and HOME were added to fix these 2 issues from 2015/2016, both for Tier0 jobs running at CERN:

XRD_LOADBALANCERTTL: https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3572.html https://github.com/dmwm/WMCore/pull/6325

HOME: https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3654.html https://github.com/dmwm/WMCore/issues/6894 https://github.com/dmwm/WMCore/pull/6325

Whether those are still a problem at CERN or not, I honestly don't know though.

belforte commented 2 years ago

thanks. Wow, that's ancient stuff (2015 !). Given that the original thread was hinting at xroot client v.4.2 being a possible solutionc wrt 4.0.4 m and that we now run v4.5.0 now, I am not going to worry.

$HOME is a different story, it is still needed.

belforte commented 2 years ago

this is also the time to find a definitive solution to https://github.com/dmwm/WMCore/issues/10257 Basically to put on firm ground the CRAB vs. WMCore JobWrapper. Is there some common environment and code that we can share ?

belforte commented 2 years ago

Alan asked for a google doc as a start, but I found it easier to start with a GH wiki which can hopefully be turned into a bit of permanent documentation for CRAB developers. https://github.com/dmwm/CRABServer/wiki/RunTime-CRAB-vs-WMCore

Anyhow I also copied the markdown text here https://docs.google.com/document/d/13IIxPGbQS3a3k0Vl0j3o9ivJSO0ZoMtg5D8XN8xXSFE/edit?usp=sharing

@mapellidario please review and let's make sure that it makes sense from our side first, then we will ask Alan to have a chat about it

mapellidario commented 2 years ago

As I progress with the review of our jobwrapper, comparing it with the current WMCore one, I will add here a list of action items:

belforte commented 2 years ago

I am not sure that multiple architectures makes sense for CRAB unless we drastically change other things.

belforte commented 1 year ago

almost time to raise priority to critical: https://cms-talk.web.cern.ch/t/crab-test-cmssw-12-6-x-invalid-site-local-config/15423/1

mapellidario commented 1 year ago

I addressed this issue in the PRs:

Al these are included in the latest CRABServer tag https://github.com/dmwm/CRABServer/releases/tag/v3.230220 and are running in production since wednesday morning.

I consider this issue as completed and move further discussion about the jobwrapper to new issues. If anybody does not agree, feel free to re-open this issue!