DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
112 stars 174 forks source link

Problem in InputDataResolution #409

Closed phicharp closed 12 years ago

phicharp commented 12 years ago

As a result of investigations on the crashes of merging jobs that have one file present in two SEs at a site (-DST and -FAILOVER), I found out that the JobWrapper is not looking for active replicas. In addition there is a fix to be made in InputDataResolution in the case a file is on two disk SEs at the same site.

The InputDataResolution class is called with the replicas already resolved in arg['FileCatalog']. This resolution is done in... JobWrapper.py (!):

def __getReplicaMetadata( self, lfns ): """ Wrapper function to consult catalog for all necessary file metadata and check the result. """ start = time.time() repsResult = self.rm.getReplicas( lfns )

This is incorrect as it should be getActiveReplicas(). Then the logic is screwed up by the fact that there are two replicas on disk at the site, which is clearly not foreseen, and the bug is at line 142 of InputDataByProtocol:

for se, lfnDict in seFilesDict.items():
  pfnList = lfnDict.values()
  result = self.rm.getStorageFileAccessUrl( pfnList, se, protocol = requestedProtocol )

where for se='xxx-FAILOVER', lfnDict is empty, therefore getStorageFileAccessUrl is called with an empty list. It is however not sufficient to protect it, as the SE that is kept is essentially random (it is the one that has the largest number of replicas for the input data...). If only active replicas are selected in the job wrapper, probably a random selection is not worse than any other...

atsareg commented 12 years ago

Any progress with this issue ?

atsareg commented 12 years ago

From Ricardo: I did some check on this, but I could see the problem that Philippe reports. FAILOVER replicas are active replicas, just as any other replicas. The only thing that can happen is that a job is scheduled based on the presence of a replica at a FAILOVER SE, and when the job arrives to the WN, the replica is no longer there, But the same thing will happen to any jobs that is scheduled and afterwards replicas if its input data are removed. It might fail. It will be rescheduled.

phicharp commented 12 years ago

Obviously getActiveReplicas should be used in order to avoid getting a file from a banned or ARCHIVE SE. This seems obvious, right? After 2 months I don't fully recall the details of the debugging I did, but for sure when a file was available at two SEs (-DST and -FAILOVER) at the site, the FAILOVER replica was not properly handled, and pfnList was empty. Apologies for not being able to give more information now... but you can of course do the debugging yourself: when we are in this situation jobs all fail!

phicharp commented 12 years ago

Why did you close the issue? Up to you but this is a real issue that should be fixed...

graciani commented 12 years ago

Hi Philippe, jobs will failed if they are scheduled to a site based on the presence of a replica in a given SE and when the job starts to execute the file is no longer there. This has nothing to do with FAILOVER or Active SEs. It is just that replicas are resolved in the WMS when the jobs is submitted and nothing guarantees that the files are there when the jobs executes.

A different issue is that when the InputDataResolution is done, only Active Replicas should be considered. But this has nothing to do with JobWrapper or InputDataResolution.

Finally, I will check the ReplicaManager is protected against a empty list in the argument.

phicharp commented 12 years ago

I feel guilty for not having put all the information I had at the time, but the problem was obviously not that one. This would be a normal case... If you want I can redo the debugging analysis and add more explaination. It is a pity that none looked at this issue for 2 months...

Le 23 mars 2012 10:51, Ricardo Graciani a crit :

Hi Philippe, jobs will failed if they are scheduled to a site based on the presence of a replica in a given SE and when the job starts to execute the file is no longer there. This has nothing to do with FAILOVER or Active SEs. It is just that replicas are resolved in the WMS when the jobs is submitted and nothing guarantees that the files are there when the jobs executes.

A different issue is that when the InputDataResolution is done, only Active Replicas should be considered. But this has nothing to do with JobWrapper or InputDataResolution.

Finally, I will check the ReplicaManager is protected against a empty list in the argument.


Reply to this email directly or view it on GitHub: https://github.com/DIRACGrid/DIRAC/issues/409#issuecomment-4656229

Dr Philippe Charpentier J09210, Physics Department, CERN, CH-1211 Genve 23 LHCb Experiment Distributed Computing Coordinator Office 2-R-007, Tel : +41 22 767 4244 , GSM : +41 76 487 0167 Mailto:Philippe.Charpentier@cern.ch http://cern.ch/Philippe.Charpentier

graciani commented 12 years ago

I did look and could not see any problem related to fact that there are 2 replicas at the site, unless replicas are removed. The active replicas are only meaningful to query in the Optimizer, but not a the level of the JobWrapper, at this time it is already too late. The only remaining problem I could see was the empty list of LFNs causing an exception in the ReplicaManager.

phicharp commented 12 years ago

OK, I have found back the problem now... So, the first one is that only active replicas should be used, but this is a minor problem, however to be fixed in the jobWrapper

Now, the most worrying one, is when an LFN is available at 2 disk SEs, for example 3 LFNs are at DST and one is at both DST and FAILOVER. Following the logic in this piece of code:

trackLFNs = {}
for lfns, se in reversed( sortedSEs ):
  for lfn, pfn in seFilesDict[se].items():
    if lfn not in trackLFNs:
      if 'Size' in replicas[lfn] and 'GUID' in replicas[lfn]:
        trackLFNs[lfn] = { 'pfn': pfn, 'se': se, 'size': replicas[lfn]['Size'], 'guid': replicas[lfn]['GUID'] }
    else:
      # Remove the lfn from those SEs with less lfns
      del seFilesDict[se][lfn]

self.log.verbose( 'Files grouped by LocalSE are:' )
self.log.verbose( seFilesDict )
for se, pfnList in seFilesDict.items():
  seTotal = len( pfnList )
  self.log.info( ' %s SURLs found from catalog for LocalSE %s' % ( seTotal, se ) )
  for pfn in pfnList:
    self.log.info( '%s %s' % ( se, pfn ) )

#Can now start to obtain TURLs for files grouped by localSE
#for requested input data
requestedProtocol = self.configuration.get( 'Protocol', '' )
for se, lfnDict in seFilesDict.items():
  pfnList = lfnDict.values()
  result = self.rm.getStorageFileAccessUrl( pfnList, se, protocol = requestedProtocol )

After the first loop, seFileDict is like this:

seFileDict = { "DST" : { lfna:pfna, lfnb:pfnb }, "FAILOVER" : {} }

because lfnb that was in FAILOVER has been removed.

Then self.rm.getStorageFileAccessUrl is called with pfnList == [] because lfnDict == {} , and this crashes the job.

I hope it is now clear that one must protect with an

if not lfnDict: continue

for example...or at the end of the first loop one can check whether seFileDict[se] is empty or not and then delete it if empty.

graciani commented 12 years ago

Will be fixed by #601