dmwm / CRABServer

15 stars 38 forks source link

Some changes on DryRunUploader #4915

Open mmascher opened 9 years ago

mmascher commented 9 years ago

I mentioned I wanted to make DryRunUploader mandatory so we could check user's task in an easier way.

Thinking this through it might not be as easy as I though:

1) The dryrun archive should have a name which is dependant from the taskname: https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DryRunUploader.py#L49 so we can choose which task we want to retrieve the dryrun tarball. Probably the name should even be sandbox-hash dependant. 2) That means crab purge must be aware of this and must clean the dryrun sandbox as well, otherwise people will not be able to clean their space on the crabcahe 3) Maybe this failure: https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DryRunUploader.py#L51-L52 is not so critical and we should not raise TaskWorkerException if the user did not select dryrun (minor and questionable)

All in all I think I will revert this change https://github.com/dmwm/CRABServer/commit/fd1cf3e5c36ef0af5bc78b7acf788da6c36d71de#diff-ff551d45cfbad36f9969f573a4176a2dL148 until this issue is addressed.

belforte commented 9 years ago

sort of back to... all those sandboxes are in the schedd's webdir's from where they can be retrieved with a simple wget/curl... or downloaded from the web via glidemon or dashboard, cleanup is in place etc... why do we go through UFC ? I surely keep missing things, but.. if what's in the webdir is enough for HTCondor to ship around and run jobs on WN's, what else do we need but a (tricky, but hopefully not so much) script to pretend an lxplus node is such a WN ? (when I say lxplus, of course it mean anything which can have similar env. from FNAL LPC to my desktop). Did we go through UFC only because proxy through cmsweb was not there yet ?

mmascher commented 9 years ago

I am sure I was the one who told Anna to use the crabcache, but I do not remember the reason. It was probably a combination of 1) having a dry run sandbox ready without having to build it in the client (in the spirit of "let's do as much work as we can in the server"), and 2) the webdir not accessible from the client.

What you say makes sense indeed. What do you think @annawoodard ?

belforte commented 9 years ago

1) is surely still true. And even using a standard portal instad of going to schedd's disks may be good, but it is important to know why we do things. hmm.. storing sandboxes in the UFC may give us a way to move tasks from one schedd to another in case of hardware failures. But the cache size may be an issue. Besides.. how safe is that cache ? OTOH this becomes more appealing once we will recycle sandboxes in "multicrab". If benefits are large enough we suffer the price of failing submission when upload to UFC fails. Anyhow, let's stick with what is simpler now and make sure we think the implications of future changes, before coding them.

On 08/04/2015 12:08 PM, Marco Mascheroni wrote:

I am sure I was the one who told Anna to use the crabcache, but I do not remember the reason. It was probably a combination of 1) having a dry run sandbox ready without having to build it in the client (in the spirit of "let's do as much work as we can in the server"), and 2) the webdir not accessible from the client.

What you say makes sense indeed. What do you think @annawoodard https://github.com/annawoodard ?

— Reply to this email directly or view it on GitHub https://github.com/dmwm/CRABServer/issues/4915#issuecomment-127554157.

Stefano Belforte - I.N.F.N. TRIESTE - Italy tel: +39 040 375-6261 (fax: +39 040 375-6258) mobile: +39 328 010 7327 AIM: stefanobelforte GTALK: stefano.belforte@gmail.com

mmascher commented 9 years ago

Ah there was a third (and probably the main reason) why the dryrun sandbox is uploaded to the crabcache instead of being put to the schedd: the dryrun does not submit a task to the schedd!

matz-e commented 8 years ago

Together with @belforte's note on stepping back a bit… I was looking at #4912 and #5125, and I am wondering if we couldn't rework this a little more. We basically parse the DAG and JobAd files to get the right parameters to call the job run script. Couldn't we put this break not after DataDiscovery instead?

Before:

DataDiscovery -> Splitter -> DagmanCreator -> DryrunUploader (-> parsing on client -> execution)

After:

DataDiscovery -> DryrunUploader (-> unpacking on client -> execution)

For very fine splitting, the current setup will respect the lumi-mask (that's how I read the code at least), and wouldn't even attempt to read more the files passed along. By intercepting after data-discovery, we could we can make up our own generous lumi-mask and input file list, and could even change splitting settings after the dry-run completes? I think that most of the other parameters can also be passed back to the client without going through the JobAd/DAG files.

mmascher commented 8 years ago

Hi @matz-e , not sure I got everything you said and what I am gonna say is related. IIRC the dryrun on the client is parsing the output of this line (don't remember exaclty how it works but it is taking it from 'Job.submit' or 'RunJobs.dag'): https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DagmanCreator.py#L127

What it would be great imho is if we had a json file like job_argument.json that looks like:

{
  "1" : {
    "-a" : "sandbox.tar.gz",
    "--sourceURL" : "https://cmsweb.cern.ch/crabcache",
    "--jobNumber" : "1",
    "--cmsswVersion" : "CMSSW_7_4_7",
    "--scramArch" : "slc6_amd64_gcc491",
    "--inputFile" : "job_input_file_list_1.txt",
    "--runAndLumis" : "job_lumis_1.json",
    "--lheInputFiles" : "False",
    "--firstEvent" : "1",
    "--firstLumi" : "1",
    "--lastEvent" : "2",
    "--firstRun" : "1",
    "--seeding" : "AutomaticSeeding",
    "--scriptExe" : "crab_SVFitCentralFull.sh",
    "--eventsPerLumi" : "100",
    "--scriptArgs" : "[]",
    "-o" : "{}",
  },

  "2" : {
    "..." : "..."
  },
}

This would avoid having to parse the output of DagmanCreator (which maybe could also use this file). The archives containing job_lumis_1.json and job_input_file_list_1.txt should also be uploaded to the crabcache. The other consequences of this is that users could use this file (and the other files) to execute the whole task on another batch system (i.e.: on LSF)

matz-e commented 8 years ago

I'm still saying we should move the dryrun to before splitting. The lumi-mask and the file-list are just getting in the way. Just for my own curiosity, I tried the dry-run with EventAwareLumiBased splitting with 50 events, and got pretty nonsensical results. I assume that in this case the lumi-mask is limiting the dry-run to only ever look at one lumi?

Getting most of those settings to the Client should not have to wait until after the DAG creation… this way, one could also let the user change the units per job, since the splitting hasn't been run yet – which would avoid unnecessary overhead (deleting the current crab area/changing the splitting in the config) on the user side.

matz-e commented 8 years ago

I reworked the parsing in the PR to the CRABClient, but I would like to see it work better still. Maybe I should try to code up a PR to show the changes I had in mind?

mmascher commented 8 years ago

Probably what I have in mind is a different use case compared to the dryrun (use-case that has a similar requiremnt of being able to isolate what we need to execute a job: files+arguments)

So yes, go ahead!