make preparelocal use S3 for tarball and unify with --dryrun

belforte commented 3 years ago

currently preparelocal creates a tarball with all needed stuff for executing the job wrapper (inside DagmanCreator) called InputFiles.tar.gz and sends it to the schedd, from where crab preparelocal fetches it to create local directory where to run the job.

Such tarball should be transferred via S3 cache instead, and possibly with same code as for --dryrun.

Even better, --dryrun should be executed inside the directory created by preparelocal, and should not be part of the submit command.

something like:

submit --dry : does all things like submit, uploads all tarballs, but does not submit to schedd
preparelocal : like now, creates local dir for a task submitted or submitted with --dry
crab testrun (new command) runs in the local director for 10 events (like current submit --dry)

Also, currently in the schedd there are both InputFiles.tar.gz and input_files.tar.gz ! pretty damn confusing

Difficulty is to have a way to implement a piece at a time, w/o breaking things.

belforte commented 1 year ago

should be enough to go in this order:

make crab preparelocal use things from S3 and never talk to sched
implement crab testrun as something to do after crab preparelocal
change submit --dry to be stop after uploading and point user to crab testrun

belforte commented 1 year ago

this relates with https://github.com/dmwm/CRABServer/issues/7461

novicecpp commented 2 months ago

~~The goals of this issue~~ The things we agreed in the last meeting are:

We can deprecate crab submit --dryrun behavior.
- Current behavior is we upload "custom" InputFiles.tar.gz to S3 and set tasks status to UPLOADED (see DryRunUploader). Then, client pull this custom inputFiles.tar.gz, extract, and run locally. Each time it run will increase number of event it consumed and measure a time to give user idea how to adjust splitting parameter.
- Users more likely want to check if crabConfig.py/PSet.py able to run in grid than see how long does it take (citation needed).
Nothing change to preparelocal. Download the submitted file and prepare env for run locally. Note that to do preparelocal, user need to submit the real task with limited total job to run.

novicecpp commented 2 months ago

I would like to explain how we move input files (wrapper, scripts, and user sandbox) in crab systems first.

In the normal job submission:

First, user submit the task, client upload the sandbox to S3 (s3://crabcache/<cernusername>/sandboxes/<sha1>.tar.gz).
Second, TW create InputFiles.tar.gz as zipped executable to transfer to schedd later https://github.com/dmwm/CRABServer/blob/9b4679d14bb19ccc7373d56c20631eea34d80a69/src/python/TaskWorker/Actions/DagmanCreator.py#L726-L729.
- This include user sandbox.tar.gz where we download back from S3 then add to the list of InputFiles https://github.com/dmwm/CRABServer/blob/9b4679d14bb19ccc7373d56c20631eea34d80a69/src/python/TaskWorker/Actions/DagmanCreator.py#L1239-L1240 List of files in InputFiles.tar.gz is described in here
Third, the InputFiles.tar.gz will be "submitted" to schedd along with subdag.jdl (I do not know what it is) through transfer_input_files https://github.com/dmwm/CRABServer/blob/9b4679d14bb19ccc7373d56c20631eea34d80a69/src/python/TaskWorker/Actions/DagmanSubmitter.py#L540 of condor submit command.
Forth, in schedd, dag_bootstrap_startup.sh extract InputFiles.tar.gz to SPOOL_DIR https://github.com/dmwm/CRABServer/blob/9b4679d14bb19ccc7373d56c20631eea34d80a69/scripts/dag_bootstrap_startup.sh#L54
Finally, when Dagman submit the job, it use transfer_input_files again but only upload a necessary files. You can see the list here and here https://github.com/dmwm/CRABServer/blob/9b4679d14bb19ccc7373d56c20631eea34d80a69/src/python/TaskWorker/Actions/DagmanCreator.py#L537-L552

novicecpp commented 2 months ago

So, to unify preparelocal and submit --dryrun, I will make submit --dryrun to have the similar UX as preparelocal but without submitting to the schedd. This functionality is for users to test their new tasks before real crab submission (Thanks to Dario for the idea and pointout the pain of preaparelocal).

This is why I said earlier that I want to deprecate the behavior, not remove the command.

This is possible and easy to do in client side. Thanks to Dario again for crab recover command that use other command like crab status/kill behind the scence.

Basically, crab submit --dryrun will submit the task to let TW create the InputFiles.tar.gz, wait until task status change to UPLOADED and then continue to trigger preparelocal command.

For the server side, well..a lot of code changes needed, obviously the InputFiles.tar.gz need to upload to S3, and spliting sandbox from InputFiles.tar.gz to not reupload the sandbox again and again, especially the compaign-like task where it share sandbox.

I will write the detail down tomorrow.

novicecpp commented 2 months ago

And of course thanks to Stefano for the original ideas on how to unify both commands and the attempted of improving the --dryrun code (both client and server side) that I can reuse it to fix this issue.

belforte commented 2 months ago

sounds like you know more than me about this matter now :-)

subdag.jdl is used for automatic splitting, we do not want users to run automatic splitting machinery interactively via --dryrun

novicecpp commented 2 months ago

Here is the pointer to the code: Server: PR https://github.com/dmwm/CRABServer/pull/8645 Client: PR https://github.com/dmwm/CRABClient/pull/5329

These PR changes 3 things:

Separate sandbox.tar.gz from InputFiles.tar.gz
- InputFiles.tar.gz is the file contains everything we need to run in schedd and work node.
- This is needed to unify the code between submit, submit --dryrun, and preparelocal.
crab submit --dryrun do preparelocal, but the task is not submit to schedd.
preparelocal --jobid 1 simply call preparelocal and execute in one go
- When I read the preparelocal code, I feel like it does not make any sense to have a different way to run the job. I might be wrong though.

Because I separated sandbox from InputFiles.tar.gz, I am not sure if some other commands or code have the assumption that it expects sandbox in there, beside the code I changed. So, it needs more test. But the simple test is passed though https://gitlab.cern.ch/crab3/CRABServer/-/pipelines/7990590 ( the branch base on master 7f4e9eed).

novicecpp commented 2 months ago

@belforte I have moved this task to Todo in "CRAB Work Planner". Feel free to bump priority as you see fit.

belforte commented 1 month ago

I looked at the code and have only some minor questions. Time to lay out a plan. As a general strategy I'd rather break submit --dryrun for a while then mix old and new code. Cleaning up has always been difficult.

[x] 1. add support for checkS3Object in RESTCache . This can go in immediately and deploy new REST
[x] 2. add checkS3Object to ServerUtilities
[x] 3. Modify AdjustSites and dag_bootstrap_startup to download sandbox from S3. Can go in at any time
[x] 4. Modify DagmanCreator to stop downloading sandbox.tar.gz and upload slimmed InputFiles.tar.gz. Can go in at any time after 3.
[x] 5. could changes to Client go in here ? Need to test
[x] 6. Modify Handler to use new DryRun.py instead of DryRunUploader.py
[ ] 7. Deploy new Client if not done at 5.
[x] 8. take this chance to also fix https://github.com/dmwm/CRABClient/issues/5288 possibly adding the new tarball

belforte commented 1 month ago

OK. Let's deploy the change to DagmanCreator ASAP so we can push new client in production.

belforte commented 4 weeks ago

back to this. Code from Wa is in https://github.com/novicecpp/CRABServer/tree/get_sandbox_in_schedd I have now imported that branch in my repo: https://github.com/belforte/CRABServer/tree/Sandbox-in-Sched-from-Wa

belforte commented 3 weeks ago

1 and 2 done in https://github.com/dmwm/CRABServer/pull/8740 3 done in https://github.com/dmwm/CRABServer/pull/8743

will do 4 once those are tested and merged 4 done in #8745

belforte commented 2 weeks ago

doing 4. I found a bug in #8740 which was only affecting sandbox existance check in new code. Will fix in same PR as to push 4.

But I have also found that we also still need debug_files.tar.gz to be moved around via TW, which creates confusion. Should change AdjustSites.py to download that and create that debug_files directory at bootstrap time. Better yet find a way to upload them as files to S3, like down for twlog, but that may require extensve changes to Client, ServerUtilities and RestCache. Current code supports uploading of a single tarball using objectype=debugfiles. But we want 3 separate files for config, pset and scriptExe

belforte commented 2 weeks ago

let's go with a:

[x] 4.5 expand debug_files.tar.gz in AdjustSites and stop handling it in DagmanCreator - #8747

belforte commented 2 weeks ago

at the moment preparelocal in current client does not work with new TW and new client does not work with current TW. I need to make at least one of them backward compatible.

belforte commented 2 weeks ago

I decided to go in steps.

deploy a Client compatible with new handling of tarballs via S3 https://github.com/dmwm/CRABClient/issues/5340
deploy TW changes done until now
do further changes for dryrun and do not worry if it stays broken until client is updated. Or maybe simply deprecate the "run" part of submit --dryrun and tell uses to use preparelocal and run_job

@aspiringmind-code @novicecpp I will gladly get your advice if you feel like suggesting something different

belforte commented 6 days ago

time to look at --dryrun . Problem is how to make a smooth transition. Let's start with a compatibility matrix

client	TW 241017 (prod )	TW 241018 (preprod)
240930 (prod)	FAILS [1]	FAILS [2]
241029 (IB-dev)	FAILS [1]	FAILS [2]
240621 (pre)	WORKS	FAILS [2]

so long for a "smooth transition" :-)

At least desperate users have a fall-back until we update TW.

[1]

WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/SCRAM/Core/CMD.py", line 58, in cmd_project
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/SCRAM/Core/Commands/project.py", line 71, in process
    return project_bootfromrelease(project.upper(), version, releasePath, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/SCRAM/Core/Commands/project.py", line 116, in project_bootfromrelease
    area = relarea.satellite(installdir, installname, symlink, Core().localarea())
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_77/SCRAM/Configuration/ConfigArea.py", line 182, in satellite
    utime(join(devconf, "Self.xml"))
FileNotFoundError: [Errno 2] No such file or directory

possibly files are created in the tmp directory in a way not compatible with latest CMSRunAnalysis.py

[2]

Created temporary directory for dry run sandbox in /tmp/belforte/tmpyycxyjaq
ERROR: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/belforte/tmpyycxyjaq/sandbox.tar.gz'

traceback

  File "/cvmfs/cms.cern.ch/share/cms/crab-prod/v3.240930.00/lib/CRABClient/Commands/submit.py", line 401, in executeTestRun
    tf = tarfile.open(os.path.join(tmpDir, name))
  File "/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/python3/3.9.14-4612d00f9f0430a19291545f1e47b4a4/lib/python3.9/tarfile.py", line 1620, in open
    return func(name, "r", fileobj, **kwargs)
  File "/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/python3/3.9.14-4612d00f9f0430a19291545f1e47b4a4/lib/python3.9/tarfile.py", line 1684, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/python3/3.9.14-4612d00f9f0430a19291545f1e47b4a4/lib/python3.9/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/belforte/tmpyycxyjaq/sandbox.tar.gz'

belforte commented 6 days ago

so currently --dryrun is broken since 10 days (interesting that no user reported it). And we way want to deploy new TW anyhow. So I only need to fix thing for "a new client"

OTOH if we deploy a client which works with current TW, nobody can complain !

So let's see if I can achieve that.

belforte commented 6 days ago

New client requires to update REST.

belforte commented 6 days ago

REST updated :-)

belforte commented 3 days ago

I tried various things, but too much gets broken. Better break dryrun for crab-pre as well, i.e. deploy new TW and then a modified client which "resurrects it". In the meanwhile user will have to do with crab preparelocal.

Main problem is that current DryRunUploader action creates as runtimefiles object a tarfaile which contains InputFiles.tar.gz and this confused the code in prepareLocal which expects to get directly InputFiles.tar.gz from S3.

Best solutions seems to me to modify DagmanCreator so that splitting-summary.json is added in InputFiles.tar.gz and then "nullify" DryRunUploader as per this and replace with DryRun

In other words:

deploy new client
fix TW with all new code and let dryrun stay broken until new TW is in place

belforte commented 2 days ago

I believe I converged on implementation. We need to deploy new client v3.241030 in order to update TW.

Then I am going for:

add an Uploader action after DagmanCreator, always. So tasks always go through the UPLOADED status
if --dryrun, stop there, otherwise add also DagmanSubmitter action

DagmanCreator creates InputFiles.tar.gz and Uploader uploads it to S3

DagmanCreator also creates splitter-summary.json used for --dryrun and adds it to InputFilers.tar.gz

submit --dryrun is modified to use the new JSON way to pass arguments to CMSRunAnalysis.py and gets simplere and more readable, and keeps working.

my current branch https://github.com/belforte/CRABServer/tree/make-dryrun-work-again should be all that's needed TW-side

I will further simplify by removing the --jobid option for crab preparelocal. Users will have these possibilities:

crab submit --dryrun : same look as now, simply runs for 25 seconds, limited usefulness
crab preparelocal AND sh run_job.sh to do a proper test/debug of their jobs Everything will run in projdir/local and files will stay after command completes, no tmp directories !

belforte commented 1 day ago

time for a new, extended, compatibility matrix

client	TW 241017 (prod )	TW 241018 (preprod)	PR 8767
241030 (prod) submit	OK	OK	OK
241030 (prod) preparelocal	OK	OK	OK
241030 (prod) dryrun	FAIL [k1]	FAIL [k2]	FAIL [k2]
my new branch submit	OK	OK	OK
my new branch preparelocal	OK( FAIL [nt] )	OK	OK
my new branch dryrun	FAIL [eof]	FAIL [eof]	OK

Error legend: [k1] known error [1] in https://github.com/dmwm/CRABServer/issues/6544#issuecomment-2452221620 [k2] known error [2] in https://github.com/dmwm/CRABServer/issues/6544#issuecomment-2452221620 [eof] ERROR: EOFError: Compressed file ended before the end-of-stream marker was reached during untar [nt] no InputFiles.tar.gz in S3 (fail to retrieve runtimefiles ) - only happens if task has not bootstrapped yet

belforte commented 1 day ago

so maybe I do not need "my new" TW.

But it is not acceptable that new client will fail to do preparelocal on tasks submitted with current TW.

Let's see if I can enhance https://github.com/belforte/CRABClient/tree/new-preparelocal-dryrun to solve [nt]

belforte commented 1 day ago

[nt] failure is due to running preparelocal before task has bootstrapped on the scheduler. With new TW all needed files will be available in S3 after DagmanCreator has run, but with previous TW one has to wait for files to be in the WEB_DIR. As simple as "I was testing w/o waiting long enough" since now after a task is in SUBMITTED it takes > 1min for bootstrap to run due to https://github.com/dmwm/CRABServer/commit/83b8daca8fe4d35020453b704666a75605915881

This means that when we will deploy this (after TW v3.241018 or later) there will no problem in running preparelocal on tasks submitted "up to 1 month ago" with TW v3.241017 (current)

I verified this on a recently submitted user task https://cmsweb.cern.ch/crabserver/ui/task/241106_185715%3Askeshri_crab_370355_Muon0 crab remake + status + preparelocal + sh run_job.sh 1 worked like a charm

belforte commented 1 day ago

conclusion: https://github.com/dmwm/CRABServer/pull/8767 will close this from TW side. Leaving to future work cleaning up of compatibility and further reductions/simplification of all that passing around of files.

belforte commented 1 day ago

And of course there's no point in making crab --dryrun submit work with TW's deployed before that client version ! So once we deploy TW as of #8767 we can then safely update CRAB Client and everything will work

belforte commented 1 day ago

only remaining issue that new client with new TW prints an annoying message for crab preparelocal about the failure to retrieve webdirproxy information from REST until the task bootstrap has completed. Even if that info is not needed. Same for submit --dryrun

The noise comes from https://github.com/dmwm/CRABServer/blob/483adb22e60031b98fb5c157b6a0b11e09fc78d7/src/python/ServerUtilities.py#L269

which in this cases uses CRABClient.RestInterfaces.CRABRest implementation of crabserver

But why do we get http 500 instead of a simple 'None' as webdir ?

Maybe something to be improved in REST ?

REST log has

ERROR, json.dumps failed to serialize None, type <class 'NoneType'>
Exception: ExecutionError 518eb4faf0ae63c2b55d8d6167a244f6 [HTTP 500, APP 403, MSG 'Execution error', INFO 'Webdir not set in the database. Cannot build proxied webdir', ERR None]

Which comes from https://github.com/dmwm/CRABServer/blob/483adb22e60031b98fb5c157b6a0b11e09fc78d7/src/python/CRABInterface/RESTTask.py#L218-L224

Let's address this in a ad-hoc issue for the REST

belforte commented 1 day ago

All OK now !

Last bit of cosmetics could be to try "new" way first in crab preparelocal to avoid annoying message Failed to download 'InputFiles.tar.gz' from WEB_DIR. Will try to get tarballs from S3 which now will be printed on all new tasks until a new client is deployed.

Well can simply remove the msg, but all in all try new way first is better and will make it more clear which code to remove later on.

belforte commented 1 day ago

test new client as per Client Pull Request PR 5347after changing to use "new way first"

client	TW 241017 (prod )	TW 241018 (preprod)	PR 8767
PR 5347 submit	OK	OK	OK
PR 5347 preparelocal	OK (FAIL [nowb] if not bootstrapped)	OK	OK
PR 5347 dryrun	FAIL [nowb]	FAIL [nowb]	OK

[nowb] : Reason is: Webdir not set in the database. Cannot build proxied webdir

belforte commented 1 day ago

and of course the status for TW v3.241018 is only for information during this development, we are not going to deploy that, we'll go straight for a new tag with the fix from #8767

belforte commented 16 hours ago

closed via #8767 and https://github.com/dmwm/CRABClient/pull/5347

dmwm / CRABServer

make preparelocal use S3 for tarball and unify with --dryrun #6544