wrong SURL in bl_runningJob if local_stage_out is set

ericvaandering commented 10 years ago

Original Savannah ticket 73851 reported by fanzago on Mon Oct 11 06:21:20 2010.

Hi, if local stage out backup system is set, the SURL saved in the BOSS DB bl_runningJob as storage is incomplete, since it has only the path but lacks the SE. This results in failure to copy the output from the temporary directory to anywhere.

It can be fixed by changing pfns.append(os.path.dirname(f['PFN'])+'/')

to pfns.append("srm://"+f['SEName']+":"+os.path.dirname(f['PFN'])+'/')

in lines 263 and 276 of CopyData.py

cheers, stefano

ericvaandering commented 10 years ago

Comment by fanzago on Mon Oct 11 07:48:47 2010

Hi Stefano, I know there is a problem with local_stage_out and CopyData only if the protocol used for the local copy was hadoop or rfio, otherwise the endpoint stored in the bossDB and in the fjr should be correct. Which is protocol used for your locat stageout? Federica

ericvaandering commented 10 years ago

Comment by fanzago on Thu Oct 14 12:11:26 2010

The problem with fallback copy that uses "local" protocol as rfio and handoop is due to the endpoint reported in the fjr (in the pfn tag) and in the bossDB (in the storage field). It doesn't contain the storage name and protocol info but only the PFN, because the endpoint used for the copy from WN to the fallback SE is the simply physical path of file. So the copyData doesn't work because the info reported in the bossDB is the PFN without info about SE name.

The solution can be done at getoutput level (as you suggested) but only if the stored PFN isn't already an endpoint (it depends from used protocol)

Or we can change the PFN with the endpoint (with sename and protocol) during the ModifyJobReport, when the job is running at WN

ericvaandering commented 10 years ago

Comment by fanzago on Mon Oct 18 12:17:30 2010

Hi all, the solution at Getoutput level doesn't work because the PFN reported in the fjr is the PFN related to the local protocol, and it can be not correct to access data using other protocols as srm. That means, we can not "create" the endpoint starting from the SE name and the PFN reported in the fjr. And during the Modify there is the same problem.

The solution could be to query the trivialfilecatalog during the fallback copy in order to get the endpoint for srmv2 protocol and to add this value to the fjr. It needs some changes in the code (fallback, ModifyJobReport, getoutput)

Or it could be nice to add the code for the copy of remote files to the local UI dir to the retry_stageout script distributed with crab. It allows the copy of files from fallback se to the official one, so it already knows the correct fallback url.

Federica

ericvaandering commented 10 years ago

Comment by fanzago on Thu Oct 21 06:19:57 2010

My solution: I added in the script retry_stageout the code for the copy of fallback output to the local UI, because this script provided the correct endpoint of fallback site.

So you can use the command: $CRABPATH/retry_stageout.py -c your_crab_dir --copy-to-ui (or -l)

and outputs associated to jobs ended with exitcode 60308 will be copied to crab_dir/res

ericvaandering commented 10 years ago

Comment by slacapra on Thu Oct 21 06:48:20 2010

I'd like to issue the same command (crab -copyData) no matter what the local backup protocol was used, rather than using a different script with different syntax.

So, I would propose one of the following:

1) in the WN, save on fjr not only the PFN with the actual protocol used, but also the PFN for grid access. Then, at client level, use the latter PFN inside copyData

2) from client, quesry the trivialfilecatalog of the SE (if this is possible, of course), and build the correct SURL from that.

Personally, I don't like much 2), since I'd like better to have all the information needed to retrieve my output locally, but probably it's just a matter of taste.

Stefano

ericvaandering commented 10 years ago

Comment by fanzago on Thu Oct 21 08:01:43 2010

Yes I understand, but to modify the code in order to have the copyData ok also for hadoop and rfio protocol it needs a lot of changes. Your first solution implies to add a tag to the fjr and to do a query to the trivialFileCalatog asking for srmv2 protocol that it is not ever declared (it could be written srm-lcg, srmv2, srm) so in order to have the correct protocol at WN level we need a lot of changes in fallback. Then we have to change the Modify script and FJR api to add the new tag and also we have to modify the getoutput to change the pfn with the endpoint, but only in the case middleware glite or osg.

The second solution can be implemented only as done in the retry_stageout because the trivialfilecalatog can not be queried from ui.For this reason I added the code directly in the retry_stageout that is also the script that takes care to copy fallback file from the fallback storage to the "official storage" decided by user in the crab.cfg file and modifies also the fjr.

I try to put in place all the step to have the first solution but it will be not in short term.

Federica

ericvaandering commented 10 years ago

Comment by fanzago on Mon Nov 8 04:42:50 2010

The first solution is implemented. A new tag "surlforgrid" is added to fjr. It contains the correct file grid surl, in the case of fallback copy with hadoop or rfio protocol. This value will be stored in the bossDB as PFN of "fallback" files, to allow the copyData from fallback SE to local UI or to other remote storageElements. The code is in cvs. Modified files: CRAB/pyhon/cmscp.py CRAB/python/GetOutput.py ProdCommon/FwkJobRep/FileInfo.py ProdCommon/FwkJobRep/ModifyJobReport.py

Cheers Federica

ericvaandering commented 10 years ago

Closed by spiga on Wed Jan 26 12:41:35 2011

dmwm / CRAB2

wrong SURL in bl_runningJob if local_stage_out is set #635