dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSUnmerged: Not handdled gfal2 exception possible to interrupt RSE cleanup and jump to the next one #11793

Closed todor-ivanov closed 1 year ago

todor-ivanov commented 1 year ago

Impact of the bug MSUnmerged

Describe the bug I am not quite sure yet if this is a real bug.

While redeploying the latest version of MSUnmergred I noticed one exception: [1], which was not supposed to be raised. This is an HTTP error raised by gfal2, related to a missing file/directory. If my interpretation of this error is not wrong, it is coming from this line here:

https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L456

And is raised while the service is trying to list a directory which is missing. The problem comes from the fact that just a line above this one, we have already did stat on this very same entry in order to figure out if it was a file or directory:

https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L439

This operation has already been enclosed in a try - except block. And gfal2 did not raise any exception (being it gfal or generic one).

The result is that this Error is handled in the pipeline and is treated as a General Error from the pipeline at this line: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L295

Which causes the cycle to drop the current RSE and jump to the next one.

How to reproduce it No idea yet

Expected behavior A missing file or a directory entry to raise exception equally for all remote operations ( e.g. stat && listdir )

Additional context and error message [1]

2023-11-08 20:57:19,815:ERROR:MSUnmerged: plineUnmerged: General error from pipeline. RSE: T2_ES_CIEMAT. Error: HTTP 404 : File not found  Will retry again in the next cycle.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 284, in _execute
    pline.run(MSUnmergedRSE(rseName))
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 137, in run
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 137, in <lambda>
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 69, in __call__
    return self.run(obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in run
    return self.func(obj, *self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 401, in cleanRSE
    purgeSuccess = self._purgeTree(ctx, dirPfn)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 471, in _purgeTree
    successList.append(self._purgeTree(ctx, dirEntryPfn))
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 456, in _purgeTree
    for dirEntry in ctx.listdir(baseDirPfn):
gfal2.GError: HTTP 404 : File not found 
todor-ivanov commented 1 year ago

In the course of extra tests I've done while working on this bug fix, I explicitly checked if gfal actually interprets the two different protocol errors (from WebDAV and SRMv2 protocols) for missing files equally, and if it returns the same error code for both of them [1]. This can clearly be seen in [1] - the error code is always 2, while only the error message changes and gfal is not simply propagating an HTTP 404 error code in the exception. This is very good for us because we would not have to add extra error handling originating from the underlying protocol difference. But of course this was the original justification for us to choose to work with gfal at the first place, instead of a plethora of different clients for every possible protocol we can find at a site.

Just to mention yet another thing I have noticed. The two protocols somehow differ in what they return in terms of Access Control attributes to the directory entry: [2]. But this is just a side note.

If we need more information on WebDAV internals and how it handles access control etc. Here is the full documentation:

@amaltaro @vkuznet @khurtado

[1]

In [58]: try:
    ...:     ctx.listdir('davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000')
    ...: except Exception as ex:
    ...:     excWebDav = ex
    ...: 

In [59]: try:
    ...:     ctx.listdir('srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000')
    ...: except Exception as ex:
    ...:     excSRMv2 = ex
    ...: 

In [60]: excWebDav
Out[60]: gfal2.GError('HTTP 404 : File not found ', 2)

In [61]: excSRMv2
Out[61]: 
gfal2.GError('Error reported from srm_ifce : 2 [SE][Ls][SRM_INVALID_PATH] No such file or directory /pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/2800000',
             2)

In [63]: excSRMv2.code
Out[63]: 2

In [64]: excWebDav.code
Out[64]: 2

[2]

In [29]: ctx.stat('srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/280000')
Out[29]: 
uid: 52
gid: 53
mode: 40755
size: 512
nlink: 1
ino: 0
ctime: 1623223126
atime: 0
mtime: 1623267391

In [35]: ctx.stat('davs://t2-xrdcms.lnl.infn.it:2880/pnfs/lnl.infn.it/data/cms/store/unmerged/Run2016F/DoubleMuonLowMass/NANOAOD/UL2016_MiniAODv1_NanoAODv2-v1/280000/')
Out[35]: 
uid: 0
gid: 0
mode: 40777
size: 0
nlink: 0
ino: 0
ctime: 1623223126
atime: 0
mtime: 1623267391