Closed phicharp closed 12 years ago
Any progress with this issue ?
From Ricardo: I did some check on this, but I could see the problem that Philippe reports. FAILOVER replicas are active replicas, just as any other replicas. The only thing that can happen is that a job is scheduled based on the presence of a replica at a FAILOVER SE, and when the job arrives to the WN, the replica is no longer there, But the same thing will happen to any jobs that is scheduled and afterwards replicas if its input data are removed. It might fail. It will be rescheduled.
Obviously getActiveReplicas should be used in order to avoid getting a file from a banned or ARCHIVE SE. This seems obvious, right? After 2 months I don't fully recall the details of the debugging I did, but for sure when a file was available at two SEs (-DST and -FAILOVER) at the site, the FAILOVER replica was not properly handled, and pfnList was empty. Apologies for not being able to give more information now... but you can of course do the debugging yourself: when we are in this situation jobs all fail!
Why did you close the issue? Up to you but this is a real issue that should be fixed...
Hi Philippe, jobs will failed if they are scheduled to a site based on the presence of a replica in a given SE and when the job starts to execute the file is no longer there. This has nothing to do with FAILOVER or Active SEs. It is just that replicas are resolved in the WMS when the jobs is submitted and nothing guarantees that the files are there when the jobs executes.
A different issue is that when the InputDataResolution is done, only Active Replicas should be considered. But this has nothing to do with JobWrapper or InputDataResolution.
Finally, I will check the ReplicaManager is protected against a empty list in the argument.
I feel guilty for not having put all the information I had at the time, but the problem was obviously not that one. This would be a normal case... If you want I can redo the debugging analysis and add more explaination. It is a pity that none looked at this issue for 2 months...
Le 23 mars 2012 10:51, Ricardo Graciani a crit :
Hi Philippe, jobs will failed if they are scheduled to a site based on the presence of a replica in a given SE and when the job starts to execute the file is no longer there. This has nothing to do with FAILOVER or Active SEs. It is just that replicas are resolved in the WMS when the jobs is submitted and nothing guarantees that the files are there when the jobs executes.
A different issue is that when the InputDataResolution is done, only Active Replicas should be considered. But this has nothing to do with JobWrapper or InputDataResolution.
Finally, I will check the ReplicaManager is protected against a empty list in the argument.
Reply to this email directly or view it on GitHub: https://github.com/DIRACGrid/DIRAC/issues/409#issuecomment-4656229
Dr Philippe Charpentier J09210, Physics Department, CERN, CH-1211 Genve 23 LHCb Experiment Distributed Computing Coordinator Office 2-R-007, Tel : +41 22 767 4244 , GSM : +41 76 487 0167 Mailto:Philippe.Charpentier@cern.ch http://cern.ch/Philippe.Charpentier
I did look and could not see any problem related to fact that there are 2 replicas at the site, unless replicas are removed. The active replicas are only meaningful to query in the Optimizer, but not a the level of the JobWrapper, at this time it is already too late. The only remaining problem I could see was the empty list of LFNs causing an exception in the ReplicaManager.
OK, I have found back the problem now... So, the first one is that only active replicas should be used, but this is a minor problem, however to be fixed in the jobWrapper
Now, the most worrying one, is when an LFN is available at 2 disk SEs, for example 3 LFNs are at DST and one is at both DST and FAILOVER. Following the logic in this piece of code:
trackLFNs = {}
for lfns, se in reversed( sortedSEs ):
for lfn, pfn in seFilesDict[se].items():
if lfn not in trackLFNs:
if 'Size' in replicas[lfn] and 'GUID' in replicas[lfn]:
trackLFNs[lfn] = { 'pfn': pfn, 'se': se, 'size': replicas[lfn]['Size'], 'guid': replicas[lfn]['GUID'] }
else:
# Remove the lfn from those SEs with less lfns
del seFilesDict[se][lfn]
self.log.verbose( 'Files grouped by LocalSE are:' )
self.log.verbose( seFilesDict )
for se, pfnList in seFilesDict.items():
seTotal = len( pfnList )
self.log.info( ' %s SURLs found from catalog for LocalSE %s' % ( seTotal, se ) )
for pfn in pfnList:
self.log.info( '%s %s' % ( se, pfn ) )
#Can now start to obtain TURLs for files grouped by localSE
#for requested input data
requestedProtocol = self.configuration.get( 'Protocol', '' )
for se, lfnDict in seFilesDict.items():
pfnList = lfnDict.values()
result = self.rm.getStorageFileAccessUrl( pfnList, se, protocol = requestedProtocol )
After the first loop, seFileDict is like this:
seFileDict = { "DST" : { lfna:pfna, lfnb:pfnb }, "FAILOVER" : {} }
because lfnb that was in FAILOVER has been removed.
Then self.rm.getStorageFileAccessUrl is called with pfnList == [] because lfnDict == {} , and this crashes the job.
I hope it is now clear that one must protect with an
if not lfnDict: continue
for example...or at the end of the first loop one can check whether seFileDict[se] is empty or not and then delete it if empty.
Will be fixed by #601
As a result of investigations on the crashes of merging jobs that have one file present in two SEs at a site (-DST and -FAILOVER), I found out that the JobWrapper is not looking for active replicas. In addition there is a fix to be made in InputDataResolution in the case a file is on two disk SEs at the same site.
The InputDataResolution class is called with the replicas already resolved in arg['FileCatalog']. This resolution is done in... JobWrapper.py (!):
def __getReplicaMetadata( self, lfns ): """ Wrapper function to consult catalog for all necessary file metadata and check the result. """ start = time.time() repsResult = self.rm.getReplicas( lfns )
This is incorrect as it should be getActiveReplicas(). Then the logic is screwed up by the fact that there are two replicas on disk at the site, which is clearly not foreseen, and the bug is at line 142 of InputDataByProtocol:
where for se='xxx-FAILOVER', lfnDict is empty, therefore getStorageFileAccessUrl is called with an empty list. It is however not sufficient to protect it, as the SE that is kept is essentially random (it is the one that has the largest number of replicas for the input data...). If only active replicas are selected in the job wrapper, probably a random selection is not worse than any other...